[jira] [Created] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-08-31 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17356:
--

 Summary: Out of memory when calling TreeNode.toJSON
 Key: SPARK-17356
 URL: https://issues.apache.org/jira/browse/SPARK-17356
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sean Zhong


When using MLLib, TreeNode.toJSON may cause out of memory exception with stack 
trace like this
{code}
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
at scala.collection.immutable.List$.newBuilder(List.scala:396)
at 
scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
at 
scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
at 
scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
at 
org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
at 
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at 
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
at 
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566){code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17355) Work around exception thrown by HiveResultSetMetaData.isSigned

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17355:


Assignee: Apache Spark  (was: Josh Rosen)

> Work around exception thrown by HiveResultSetMetaData.isSigned
> --
>
> Key: SPARK-17355
> URL: https://issues.apache.org/jira/browse/SPARK-17355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Attempting to use Spark SQL's JDBC data source against the Hive ThriftServer 
> results in a {{java.sql.SQLException: Method not supported}} exception from 
> {{org.apache.hive.jdbc.HiveResultSetMetaData.isSigned}}. Here are two user 
> reports of this issue:
> - 
> https://stackoverflow.com/questions/34067686/spark-1-5-1-not-working-with-hive-jdbc-1-2-0
> - https://stackoverflow.com/questions/32195946/method-not-supported-in-spark
> I have filed HIVE-14684 to attempt to fix this in Hive by implementing the 
> {{isSigned}} method, but in the meantime / for compatibility with older JDBC 
> drivers I think we should add special-case error handling to work around this 
> bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17355) Work around exception thrown by HiveResultSetMetaData.isSigned

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17355:


Assignee: Josh Rosen  (was: Apache Spark)

> Work around exception thrown by HiveResultSetMetaData.isSigned
> --
>
> Key: SPARK-17355
> URL: https://issues.apache.org/jira/browse/SPARK-17355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Attempting to use Spark SQL's JDBC data source against the Hive ThriftServer 
> results in a {{java.sql.SQLException: Method not supported}} exception from 
> {{org.apache.hive.jdbc.HiveResultSetMetaData.isSigned}}. Here are two user 
> reports of this issue:
> - 
> https://stackoverflow.com/questions/34067686/spark-1-5-1-not-working-with-hive-jdbc-1-2-0
> - https://stackoverflow.com/questions/32195946/method-not-supported-in-spark
> I have filed HIVE-14684 to attempt to fix this in Hive by implementing the 
> {{isSigned}} method, but in the meantime / for compatibility with older JDBC 
> drivers I think we should add special-case error handling to work around this 
> bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17355) Work around exception thrown by HiveResultSetMetaData.isSigned

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454417#comment-15454417
 ] 

Apache Spark commented on SPARK-17355:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14911

> Work around exception thrown by HiveResultSetMetaData.isSigned
> --
>
> Key: SPARK-17355
> URL: https://issues.apache.org/jira/browse/SPARK-17355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Attempting to use Spark SQL's JDBC data source against the Hive ThriftServer 
> results in a {{java.sql.SQLException: Method not supported}} exception from 
> {{org.apache.hive.jdbc.HiveResultSetMetaData.isSigned}}. Here are two user 
> reports of this issue:
> - 
> https://stackoverflow.com/questions/34067686/spark-1-5-1-not-working-with-hive-jdbc-1-2-0
> - https://stackoverflow.com/questions/32195946/method-not-supported-in-spark
> I have filed HIVE-14684 to attempt to fix this in Hive by implementing the 
> {{isSigned}} method, but in the meantime / for compatibility with older JDBC 
> drivers I think we should add special-case error handling to work around this 
> bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17271) Planner adds un-necessary Sort even if child ordering is semantically same as required ordering

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454405#comment-15454405
 ] 

Apache Spark commented on SPARK-17271:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/14910

> Planner adds un-necessary Sort even if child ordering is semantically same as 
> required ordering
> ---
>
> Key: SPARK-17271
> URL: https://issues.apache.org/jira/browse/SPARK-17271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.1.0
>
>
> Found a case when the planner is adding un-needed SORT operation due to bug 
> in the way comparison for `SortOrder` is done at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
> `SortOrder` needs to be compared semantically because `Expression` within two 
> `SortOrder` can be "semantically equal" but not literally equal objects.
> eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`
> Expression in required SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId,
> qualifier = Some("a")
>   )
> {code}
> Expression in child SortOrder:
> {code}
>   AttributeReference(
> name = "col1",
> dataType = LongType,
> nullable = false
>   ) (exprId = exprId)
> {code}
> Notice that the output column has a qualifier but the child attribute does 
> not but the inherent expression is the same and hence in this case we can say 
> that the child satisfies the required sort order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17355) Work around exception thrown by HiveResultSetMetaData.isSigned

2016-08-31 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17355:
--

 Summary: Work around exception thrown by 
HiveResultSetMetaData.isSigned
 Key: SPARK-17355
 URL: https://issues.apache.org/jira/browse/SPARK-17355
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


Attempting to use Spark SQL's JDBC data source against the Hive ThriftServer 
results in a {{java.sql.SQLException: Method not supported}} exception from 
{{org.apache.hive.jdbc.HiveResultSetMetaData.isSigned}}. Here are two user 
reports of this issue:

- 
https://stackoverflow.com/questions/34067686/spark-1-5-1-not-working-with-hive-jdbc-1-2-0
- https://stackoverflow.com/questions/32195946/method-not-supported-in-spark

I have filed HIVE-14684 to attempt to fix this in Hive by implementing the 
{{isSigned}} method, but in the meantime / for compatibility with older JDBC 
drivers I think we should add special-case error handling to work around this 
bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17354) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date

2016-08-31 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17354:

Description: 
Hive database has one table with column type Date. While running select query 
using Spark 2.0.0 SQL and calling show() function on DF throws 
ClassCastException. Same code is working fine on Spark 1.6.2. Please see the 
sample code below.

{code}
import java.util.Calendar
val now = Calendar.getInstance().getTime()
case class Order(id : Int, customer : String, city : String, pdate : 
java.sql.Date)
val orders = Seq(
  Order(1, "John S", "San Mateo", new java.sql.Date(now.getTime)),
  Order(2, "John D", "Redwood City", new java.sql.Date(now.getTime))
  )   
orders.toDF.createOrReplaceTempView("orders1")

spark.sql("CREATE TABLE IF NOT EXISTS order(id INT, customer String,city 
String)PARTITIONED BY (pdate DATE)STORED AS PARQUETFILE")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("INSERT INTO TABLE order PARTITION(pdate) SELECT * FROM orders1")
spark.sql("SELECT * FROM order").show()
{code}  

Exception details

{code}
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
at 
org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code} 

Expected output 

{code} 
+---+++--+
| id|customer|city| pdate|
+---+++--+
|  1|  John S|   San Mateo|2016-09-01|
|  2|  John D|Redwood City|2016-09-01|
+---+++--+
{code} 

Workaround for Spark 2.0.0

Setting enableVectorizedReader=false before show() method on DF returns 
expected result.
{code} 
spark.sql("set spark.sql.parquet.enableVectorizedReader=false")
{code} 





  was:
Hive database has one table with column type Date. While running select query 
using Spark 2.0.0 SQL and calling show() function on DF throws 
ClassCastException. Same code is working fine on Spark 1.6.2. Please see the 
sample code below.

{code}
import java.util.Calendar
val now = Calendar.getInstance().getTime()
case class Order(id : Int, customer : String, city : String, pdate : 
java.sql.Date)
val orders = Seq(
  Order(1, "John S", "San Mateo", new java.sql.Date(now.getTime)),
  Order(2, "John D", "Redwood City", new java.sql.Date(now.getTime))
  )   
orders.toDF.createOrReplaceTempView("orders1")

spark.sql("CREATE TABLE IF NOT EXISTS order(id INT, customer String,city 
String)PARTITIONED BY (pdate 

[jira] [Updated] (SPARK-17354) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date

2016-08-31 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17354:

Description: 
Hive database has one table with column type Date. While running select query 
using Spark 2.0.0 SQL and calling show() function on DF throws 
ClassCastException. Same code is working fine on Spark 1.6.2. Please see the 
sample code below.

{code}
import java.util.Calendar
val now = Calendar.getInstance().getTime()
case class Order(id : Int, customer : String, city : String, pdate : 
java.sql.Date)
val orders = Seq(
  Order(1, "John S", "San Mateo", new java.sql.Date(now.getTime)),
  Order(2, "John D", "Redwood City", new java.sql.Date(now.getTime))
  )   
orders.toDF.createOrReplaceTempView("orders1")

spark.sql("CREATE TABLE IF NOT EXISTS order(id INT, customer String,city 
String)PARTITIONED BY (pdate DATE)STORED AS PARQUETFILE")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("INSERT INTO TABLE order PARTITION(pdate) SELECT * FROM orders1")
spark.sql("SELECT * FROM order").show()
{code}  

Exception details

{code}
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
at 
org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code} 

Expected output 

{code} 
+---+++--+
| id|customer|city| pdate|
+---+++--+
|  1|  John S|   San Mateo|2016-09-01|
|  2|  John D|Redwood City|2016-09-01|
+---+++--+
{code} 

Workaround for Spark 2.0.0

Setting enableVectorizedReader=false before show() method on DF returns 
expected result.

{code} 
spark.sql("set spark.sql.parquet.enableVectorizedReader=false")
{code} 





  was:
Hive database has one table with column type Date. While running select query 
using Spark 2.0.0 SQL and calling show() function on DF throws 
ClassCastException. Same code is working fine on Spark 1.6.2. Please see the 
sample code below.

{code}

import java.util.Calendar
val now = Calendar.getInstance().getTime()
case class Order(id : Int, customer : String, city : String, pdate : 
java.sql.Date)
val orders = Seq(
  Order(1, "John S", "San Mateo", new java.sql.Date(now.getTime)),
  Order(2, "John D", "Redwood City", new java.sql.Date(now.getTime))
  )   
orders.toDF.createOrReplaceTempView("orders1")

spark.sql("CREATE TABLE IF NOT EXISTS order(id INT, customer String,city 
String)PARTITIONED BY (pdate 

[jira] [Created] (SPARK-17354) java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date

2016-08-31 Thread Amit Baghel (JIRA)
Amit Baghel created SPARK-17354:
---

 Summary: java.lang.ClassCastException: java.lang.Integer cannot be 
cast to java.sql.Date
 Key: SPARK-17354
 URL: https://issues.apache.org/jira/browse/SPARK-17354
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Amit Baghel
Priority: Minor


Hive database has one table with column type Date. While running select query 
using Spark 2.0.0 SQL and calling show() function on DF throws 
ClassCastException. Same code is working fine on Spark 1.6.2. Please see the 
sample code below.

{code}

import java.util.Calendar
val now = Calendar.getInstance().getTime()
case class Order(id : Int, customer : String, city : String, pdate : 
java.sql.Date)
val orders = Seq(
  Order(1, "John S", "San Mateo", new java.sql.Date(now.getTime)),
  Order(2, "John D", "Redwood City", new java.sql.Date(now.getTime))
  )   
orders.toDF.createOrReplaceTempView("orders1")

spark.sql("CREATE TABLE IF NOT EXISTS order(id INT, customer String,city 
String)PARTITIONED BY (pdate DATE)STORED AS PARQUETFILE")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
spark.sql("INSERT INTO TABLE order PARTITION(pdate) SELECT * FROM orders1")
spark.sql("SELECT * FROM order").show()

{code}  

Exception details

{code}
 
16/09/01 10:30:07 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 6)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.sql.Date
at 
org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:89)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:185)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:204)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:362)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code} 

Expected output 

{code} 
+---+++--+
| id|customer|city| pdate|
+---+++--+
|  1|  John S|   San Mateo|2016-09-01|
|  2|  John D|Redwood City|2016-09-01|
+---+++--+

{code} 

Workaround for Spark 2.0.0

Setting enableVectorizedReader=false before show() method on DF returns 
expected result.

{code} 

spark.sql("set spark.sql.parquet.enableVectorizedReader=false")

{code} 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17289.
-
Resolution: Invalid

The original PR that cause the bug has been reverted.

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 2.1.0
>
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17289) Sort based partial aggregation breaks due to SPARK-12978

2016-08-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-17289:
-

> Sort based partial aggregation breaks due to SPARK-12978
> 
>
> Key: SPARK-17289
> URL: https://issues.apache.org/jira/browse/SPARK-17289
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Zhong
>Assignee: Takeshi Yamamuro
>Priority: Blocker
> Fix For: 2.1.0
>
>
> For the following query:
> {code}
> val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
> "b").createOrReplaceTempView("t2")
> spark.sql("select max(b) from t2 group by a").explain(true)
> {code}
> Now, the SortAggregator won't insert Sort operator before partial 
> aggregation, this will break sort-based partial aggregation.
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- LocalTableScan [a#5, b#6]
> {code}
> In Spark 2.0 branch, the plan is:
> {code}
> == Physical Plan ==
> SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
> +- *Sort [a#5 ASC], false, 0
>+- Exchange hashpartitioning(a#5, 200)
>   +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, 
> max#19])
>  +- *Sort [a#5 ASC], false, 0
> +- LocalTableScan [a#5, b#6]
> {code}
> This is related to SPARK-12978



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-08-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-12978:
-

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.1.0
>
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17343) Prerequisites for Kafka 0.8 support in Structured Streaming

2016-08-31 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss closed SPARK-17343.
---
Resolution: Won't Fix

There will be no Kafka 0.8 connectors for Structured Streaming. See comments on 
SPARK-15406.

> Prerequisites for Kafka 0.8 support in Structured Streaming
> ---
>
> Key: SPARK-17343
> URL: https://issues.apache.org/jira/browse/SPARK-17343
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> This issue covers any API changes, refactoring, and utility classes/methods 
> that are necessary to make it possible to implement support for Kafka 0.8 
> sources and sinks in Structured Streaming.
> From a quick glance, it looks like some refactoring of the existing state 
> storage mechanism in the Kafka 0.8 DStream may suffice. But there might be 
> some additional groundwork in other areas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-08-31 Thread Frederick Reiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss closed SPARK-17344.
---
Resolution: Won't Fix

There will be no Kafka 0.8 connectors for Structured Streaming. See comments on 
SPARK-15406.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454320#comment-15454320
 ] 

Frederick Reiss commented on SPARK-15406:
-

Thanks for clearing that up. Marking SPARK-17343 and SPARK-17344 as "won't fix".

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-31 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-17241.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14856
[https://github.com/apache/spark/pull/14856]

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
> Fix For: 2.1.0
>
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter

2016-08-31 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-17241:
--
Assignee: Xin Ren

> SparkR spark.glm should have configurable regularization parameter
> --
>
> Key: SPARK-17241
> URL: https://issues.apache.org/jira/browse/SPARK-17241
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>Assignee: Xin Ren
> Fix For: 2.1.0
>
>
> Spark has configurable L2 regularization parameter for generalized linear 
> regression. It is very important to have them in SparkR so that users can run 
> ridge regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17319) Move addJar from HiveSessionState to HiveSharedState

2016-08-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17319:

Description: 
aThe added jar is shared by all the sessions, because SparkContext does not 
support sessions. Thus, it makes more sense to move {{addJar}} from 
`HiveSessionState` to `HiveSharedState`

This is also another step to remove Hive client usage in `HiveSessionState`.

Different sessions are sharing the same class loader, and thus, 
`metadataHive.addJar(path)` basically loads the JARs for all the sessions. 
Thus, no impact is made if we move `addJar` from `HiveSessionState` to 
`HiveExternalCatalog`.

  was:
aThe added jar is shared by all the sessions, because SparkContext does not 
support sessions. Thus, it makes more sense to move {addJar} from 
`HiveSessionState` to `HiveSharedState`

This is also another step to remove Hive client usage in `HiveSessionState`.

Different sessions are sharing the same class loader, and thus, 
`metadataHive.addJar(path)` basically loads the JARs for all the sessions. 
Thus, no impact is made if we move `addJar` from `HiveSessionState` to 
`HiveExternalCatalog`.


> Move addJar from HiveSessionState to HiveSharedState
> 
>
> Key: SPARK-17319
> URL: https://issues.apache.org/jira/browse/SPARK-17319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> aThe added jar is shared by all the sessions, because SparkContext does not 
> support sessions. Thus, it makes more sense to move {{addJar}} from 
> `HiveSessionState` to `HiveSharedState`
> This is also another step to remove Hive client usage in `HiveSessionState`.
> Different sessions are sharing the same class loader, and thus, 
> `metadataHive.addJar(path)` basically loads the JARs for all the sessions. 
> Thus, no impact is made if we move `addJar` from `HiveSessionState` to 
> `HiveExternalCatalog`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17319) Move addJar from HiveSessionState to HiveSharedState

2016-08-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17319:

Description: 
aThe added jar is shared by all the sessions, because SparkContext does not 
support sessions. Thus, it makes more sense to move {addJar} from 
`HiveSessionState` to `HiveSharedState`

This is also another step to remove Hive client usage in `HiveSessionState`.

Different sessions are sharing the same class loader, and thus, 
`metadataHive.addJar(path)` basically loads the JARs for all the sessions. 
Thus, no impact is made if we move `addJar` from `HiveSessionState` to 
`HiveExternalCatalog`.

  was:
This is another step to remove Hive client usage in `HiveSessionState`.

Different sessions are sharing the same class loader, and thus, 
`metadataHive.addJar(path)` basically loads the JARs for all the sessions. 
Thus, no impact is made if we move `addJar` from `HiveSessionState` to 
`HiveExternalCatalog`.


> Move addJar from HiveSessionState to HiveSharedState
> 
>
> Key: SPARK-17319
> URL: https://issues.apache.org/jira/browse/SPARK-17319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> aThe added jar is shared by all the sessions, because SparkContext does not 
> support sessions. Thus, it makes more sense to move {addJar} from 
> `HiveSessionState` to `HiveSharedState`
> This is also another step to remove Hive client usage in `HiveSessionState`.
> Different sessions are sharing the same class loader, and thus, 
> `metadataHive.addJar(path)` basically loads the JARs for all the sessions. 
> Thus, no impact is made if we move `addJar` from `HiveSessionState` to 
> `HiveExternalCatalog`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17319) Move addJar from HiveSessionState to HiveSharedState

2016-08-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17319:

Summary: Move addJar from HiveSessionState to HiveSharedState  (was: Move 
addJar from HiveSessionState to HiveExternalCatalog)

> Move addJar from HiveSessionState to HiveSharedState
> 
>
> Key: SPARK-17319
> URL: https://issues.apache.org/jira/browse/SPARK-17319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> This is another step to remove Hive client usage in `HiveSessionState`.
> Different sessions are sharing the same class loader, and thus, 
> `metadataHive.addJar(path)` basically loads the JARs for all the sessions. 
> Thus, no impact is made if we move `addJar` from `HiveSessionState` to 
> `HiveExternalCatalog`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17353) CREATE TABLE LIKE statements when Source is a VIEW

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17353:


Assignee: Apache Spark

> CREATE TABLE LIKE statements when Source is a VIEW
> --
>
> Key: SPARK-17353
> URL: https://issues.apache.org/jira/browse/SPARK-17353
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> - Add a support for tempory view
> - When the source table is a `VIEW`, the metadata of the generated table 
> contains the original view text and view original text. So far, this does not 
> break anything, but it could cause something wrong in Hive. (For example, 
> https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)
> - When the type of source table is a view, the target table is using the 
> default format of data source tables: `spark.sql.sources.default`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17353) CREATE TABLE LIKE statements when Source is a VIEW

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454174#comment-15454174
 ] 

Apache Spark commented on SPARK-17353:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14531

> CREATE TABLE LIKE statements when Source is a VIEW
> --
>
> Key: SPARK-17353
> URL: https://issues.apache.org/jira/browse/SPARK-17353
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>
> - Add a support for tempory view
> - When the source table is a `VIEW`, the metadata of the generated table 
> contains the original view text and view original text. So far, this does not 
> break anything, but it could cause something wrong in Hive. (For example, 
> https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)
> - When the type of source table is a view, the target table is using the 
> default format of data source tables: `spark.sql.sources.default`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17353) CREATE TABLE LIKE statements when Source is a VIEW

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17353:


Assignee: (was: Apache Spark)

> CREATE TABLE LIKE statements when Source is a VIEW
> --
>
> Key: SPARK-17353
> URL: https://issues.apache.org/jira/browse/SPARK-17353
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>
> - Add a support for tempory view
> - When the source table is a `VIEW`, the metadata of the generated table 
> contains the original view text and view original text. So far, this does not 
> break anything, but it could cause something wrong in Hive. (For example, 
> https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)
> - When the type of source table is a view, the target table is using the 
> default format of data source tables: `spark.sql.sources.default`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17353) CREATE TABLE LIKE statements when Source is a VIEW

2016-08-31 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17353:
---

 Summary: CREATE TABLE LIKE statements when Source is a VIEW
 Key: SPARK-17353
 URL: https://issues.apache.org/jira/browse/SPARK-17353
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Xiao Li


- Add a support for tempory view

- When the source table is a `VIEW`, the metadata of the generated table 
contains the original view text and view original text. So far, this does not 
break anything, but it could cause something wrong in Hive. (For example, 
https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1405-L1406)

- When the type of source table is a view, the target table is using the 
default format of data source tables: `spark.sql.sources.default`.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16942) CREATE TABLE LIKE generates External table when source table is an External Hive Serde table

2016-08-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16942:

Description: 
When the table type of source table is an EXTERNAL Hive serde table, {{CREATE 
TABLE LIKE}} will generate an EXTERNAL table. The expected table type should be 
MANAGED

The table type of the generated table is `EXTERNAL` when the source table is an 
external Hive Serde table. Currently, we explicitly set it to `MANAGED`, but 
Hive is checking the table property `EXTERNAL` to decide whether the table is 
`EXTERNAL` or not. (See 
https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408)
 Thus, the created table is still `EXTERNAL`. 


  was:When the table type of source table is an EXTERNAL Hive serde table, 
{{CREATE TABLE LIKE}} will generate an EXTERNAL table. The expected table type 
should be MANAGED


> CREATE TABLE LIKE generates External table when source table is an External 
> Hive Serde table
> 
>
> Key: SPARK-16942
> URL: https://issues.apache.org/jira/browse/SPARK-16942
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When the table type of source table is an EXTERNAL Hive serde table, {{CREATE 
> TABLE LIKE}} will generate an EXTERNAL table. The expected table type should 
> be MANAGED
> The table type of the generated table is `EXTERNAL` when the source table is 
> an external Hive Serde table. Currently, we explicitly set it to `MANAGED`, 
> but Hive is checking the table property `EXTERNAL` to decide whether the 
> table is `EXTERNAL` or not. (See 
> https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1407-L1408)
>  Thus, the created table is still `EXTERNAL`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17211) Broadcast join produces incorrect results on EMR with large driver memory

2016-08-31 Thread gurmukh singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454147#comment-15454147
 ] 

gurmukh singh commented on SPARK-17211:
---

yes, outside of EMR.

On a node you will have OS, nodemanger, datanode node daemons using memory.

One other we might need to look at is java 1.8. I have used java 1.8 on Apache 
Spark 2.0, for tests.

> Broadcast join produces incorrect results on EMR with large driver memory
> -
>
> Key: SPARK-17211
> URL: https://issues.apache.org/jira/browse/SPARK-17211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jarno Seppanen
>
> Broadcast join produces incorrect columns in join result, see below for an 
> example. The same join but without using broadcast gives the correct columns.
> Running PySpark on YARN on Amazon EMR 5.0.0.
> {noformat}
> import pyspark.sql.functions as func
> keys = [
> (5400, 0),
> (5401, 1),
> (5402, 2),
> ]
> keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
> keys_df.show()
> # ++-+
> # |  key_id|value|
> # ++-+
> # |5400|0|
> # |5401|1|
> # |5402|2|
> # ++-+
> data = [
> (5402,1),
> (5400,2),
> (5401,3),
> ]
> data_df = spark.createDataFrame(data, ['key_id', 'foo'])
> data_df.show()
> # ++---+  
> 
> # |  key_id|foo|
> # ++---+
> # |5402|  1|
> # |5400|  2|
> # |5401|  3|
> # ++---+
> ### INCORRECT ###
> data_df.join(func.broadcast(keys_df), 'key_id').show()
> # ++---++ 
> 
> # |  key_id|foo|   value|
> # ++---++
> # |5402|  1|5402|
> # |5400|  2|5400|
> # |5401|  3|5401|
> # ++---++
> ### CORRECT ###
> data_df.join(keys_df, 'key_id').show()
> # ++---+-+
> # |  key_id|foo|value|
> # ++---+-+
> # |5400|  2|0|
> # |5401|  3|1|
> # |5402|  1|2|
> # ++---+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453789#comment-15453789
 ] 

Cody Koeninger edited comment on SPARK-15406 at 9/1/16 2:26 AM:


There's a big difference between continuing to publish an existing
implementation that requires little to no maintenance, and putting new
development work into a dead version.

I don't think removing existing functionality for Kafka 0.8 users makes
sense. Nor do I think new development should be chained to old versions.

We already had this argument around the Kafka connector, and I believe it
was resolved in a sensible way - keep the old existing code around, but
stop putting effort into it.



was (Author: c...@koeninger.org):
There's a big difference between continuing to publish an existing
implementation that is requires little to no maintenance, and putting new
development work into a dead version.

I don't think removing existing functionality for Kafka 0.8 users makes
sense. Nor do I think new development should be chained to old versions.

We already had this argument around the Kafka connector, and I believe it
was resolved in a sensible way - keep the old existing code around, but
stop putting effort into it.

On Aug 31, 2016 18:45, "Frederick Reiss (JIRA)"  wrote:


[ https://issues.apache.org/jira/browse/SPARK-15406?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel=15453749#comment-15453749 ]

Frederick Reiss commented on SPARK-15406:
-

WRT Kafka 0.8: I'm under the impression that there is a significant number
of Spark users who are still stuck with Kafka 0.8 or 0.9. Kafka sits
between multiple systems, so upgrades to production Kafka installations can
be challenging. Are other people monitoring this JIRA seeing a similar
situation, or are versions before 0.10 not in widespread use any more? If
older Kafka releases aren't relevant, then we should probably deprecate the
entire spark-streaming-kafka-0-8 component.

feel like time based indexing would make for a much better interface, but
it's been pushed back to kafka 0.10.1
33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454043#comment-15454043
 ] 

Reynold Xin commented on SPARK-15406:
-

+1.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17352) Executor computing time can be negative-number because of calculation error

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17352:


Assignee: (was: Apache Spark)

> Executor computing time can be negative-number because of calculation error
> ---
>
> Key: SPARK-17352
> URL: https://issues.apache.org/jira/browse/SPARK-17352
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In StagePage, executor-computing-time is calculated but calculation error can 
> occur potentially because it's calculated by subtraction of floating numbers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17352) Executor computing time can be negative-number because of calculation error

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453984#comment-15453984
 ] 

Apache Spark commented on SPARK-17352:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/14908

> Executor computing time can be negative-number because of calculation error
> ---
>
> Key: SPARK-17352
> URL: https://issues.apache.org/jira/browse/SPARK-17352
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In StagePage, executor-computing-time is calculated but calculation error can 
> occur potentially because it's calculated by subtraction of floating numbers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17352) Executor computing time can be negative-number because of calculation error

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17352:


Assignee: Apache Spark

> Executor computing time can be negative-number because of calculation error
> ---
>
> Key: SPARK-17352
> URL: https://issues.apache.org/jira/browse/SPARK-17352
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In StagePage, executor-computing-time is calculated but calculation error can 
> occur potentially because it's calculated by subtraction of floating numbers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-31 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453980#comment-15453980
 ] 

Yun Ni commented on SPARK-5992:
---

Thanks very much for reviewing, Joseph!

Based on your comments, I have made the following major changes:
(1) Change the API Design to use Estimator instead of UnaryTransformer.
(2) Add a Testing section to elaborate more about the unit tests and scale 
tests design.

Please let me know if you have any further comments, so that we can go forward.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17352) Executor computing time can be negative-number because of calculation error

2016-08-31 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-17352:
--

 Summary: Executor computing time can be negative-number because of 
calculation error
 Key: SPARK-17352
 URL: https://issues.apache.org/jira/browse/SPARK-17352
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Kousuke Saruta
Priority: Minor


In StagePage, executor-computing-time is calculated but calculation error can 
occur potentially because it's calculated by subtraction of floating numbers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17211) Broadcast join produces incorrect results on EMR with large driver memory

2016-08-31 Thread Himanish Kushary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453936#comment-15453936
 ] 

Himanish Kushary commented on SPARK-17211:
--

[~gurmukhd] You are seeing the errors outside of EMR environment also, right ?  
On Databricks I ran it through a scala notebook, not sure what the underlying 
configuration was. 

I would want to add one more thing , while running the real job (with lots of 
data), I noticed even with low driver memory settings and broadcasting enabled 
, some joins works fine but others messes up the data (either assigns null or 
field values get mixed up).  What are the other things that would use up 12 GB 
of memory on the node ?

> Broadcast join produces incorrect results on EMR with large driver memory
> -
>
> Key: SPARK-17211
> URL: https://issues.apache.org/jira/browse/SPARK-17211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jarno Seppanen
>
> Broadcast join produces incorrect columns in join result, see below for an 
> example. The same join but without using broadcast gives the correct columns.
> Running PySpark on YARN on Amazon EMR 5.0.0.
> {noformat}
> import pyspark.sql.functions as func
> keys = [
> (5400, 0),
> (5401, 1),
> (5402, 2),
> ]
> keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
> keys_df.show()
> # ++-+
> # |  key_id|value|
> # ++-+
> # |5400|0|
> # |5401|1|
> # |5402|2|
> # ++-+
> data = [
> (5402,1),
> (5400,2),
> (5401,3),
> ]
> data_df = spark.createDataFrame(data, ['key_id', 'foo'])
> data_df.show()
> # ++---+  
> 
> # |  key_id|foo|
> # ++---+
> # |5402|  1|
> # |5400|  2|
> # |5401|  3|
> # ++---+
> ### INCORRECT ###
> data_df.join(func.broadcast(keys_df), 'key_id').show()
> # ++---++ 
> 
> # |  key_id|foo|   value|
> # ++---++
> # |5402|  1|5402|
> # |5400|  2|5400|
> # |5401|  3|5401|
> # ++---++
> ### CORRECT ###
> data_df.join(keys_df, 'key_id').show()
> # ++---+-+
> # |  key_id|foo|value|
> # ++---+-+
> # |5400|  2|0|
> # |5401|  3|1|
> # |5402|  1|2|
> # ++---+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17287) PySpark sc.AddFile method does not support the recursive keyword argument

2016-08-31 Thread Jason Piper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453933#comment-15453933
 ] 

Jason Piper commented on SPARK-17287:
-

Is anyone available to review this small change?

> PySpark sc.AddFile method does not support the recursive keyword argument
> -
>
> Key: SPARK-17287
> URL: https://issues.apache.org/jira/browse/SPARK-17287
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jason Piper
>Priority: Minor
>
> The Scala Spark API implements a "recursive" keyword argument when using 
> sc.addFile that allows for an entire directory to be added, however, the 
> corresponding interface hasn't been added to PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-08-31 Thread Sean McKibben (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453884#comment-15453884
 ] 

Sean McKibben edited comment on SPARK-17147 at 9/1/16 1:08 AM:
---

I think Kafka's log compaction's design is still intended for sequential 
reading, even if the offsets are not consecutive for a compacted topic. Kafka's 
internal log cleaner process copies log segments to new files which have been 
compacted, so the messages are still stored sequentially even if the offset 
metadata for them increases by more than one. The typical consumer just does a 
poll() to get the next records, regardless of their offsets, but this Spark's 
CachedKafkaConsumer checks the offset of each record before calling poll(), and 
if that offset isn't the previous record's offset +1, it's going to call 
consumer.seek() before the next poll(), which I think is producing the dramatic 
slowdown I've seen.
It is certainly possible, using a non-Spark Kafka consumer, to get equivalent 
read speeds regardless of whether a topic is compacted.
I think the interplay between the CachedKafkaConsumer and the KafkaRDD might 
need to be adjusted. I haven't looked to see if more than one KafkaRDD will 
ever be asking for records from a single CachedKafkaConsumer instance, but 
since CachedKafkaConsumer was inspecting each offset to see if it was exactly 
the offset requested, and not just >= the requested offset, I'm guessing there 
was a reason. 

The main issue here is that it's becoming apparent that Kafka consumers can't 
assume consecutively increasing offsets. Unfortunately that is an assumption 
that Spark-Kafka was making, and I think that assumption will need to be 
removed.
(Edit: changed "monotonically" to "consecutively" above, since consumers _can_ 
assume an ever increasing set of offsets, just not consecutively increasing) 


was (Author: graphex):
I think Kafka's log compaction's design is still intended for sequential 
reading, even if the offsets are not consecutive for a compacted topic. Kafka's 
internal log cleaner process copies log segments to new files which have been 
compacted, so the messages are still stored sequentially even if the offset 
metadata for them increases by more than one. The typical consumer just does a 
poll() to get the next records, regardless of their offsets, but this Spark's 
CachedKafkaConsumer checks the offset of each record before calling poll(), and 
if that offset isn't the previous record's offset +1, it's going to call 
consumer.seek() before the next poll(), which I think is producing the dramatic 
slowdown I've seen.
It is certainly possible, using a non-Spark Kafka consumer, to get equivalent 
read speeds regardless of whether a topic is compacted.
I think the interplay between the CachedKafkaConsumer and the KafkaRDD might 
need to be adjusted. I haven't looked to see if more than one KafkaRDD will 
ever be asking for records from a single CachedKafkaConsumer instance, but 
since CachedKafkaConsumer was inspecting each offset to see if it was exactly 
the offset requested, and not just >= the requested offset, I'm guessing there 
was a reason. 

The main issue here is that it's becoming apparent that Kafka consumers can't 
assume monotonically incrementing offsets. Unfortunately that is an assumption 
that Spark-Kafka was making, and I think that assumption will need to be 
removed.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-17323) ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.

2016-08-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17323:

Fix Version/s: 2.0.1

> ALTER VIEW AS should keep the previous table properties, comment, 
> create_time, etc.
> ---
>
> Key: SPARK-17323
> URL: https://issues.apache.org/jira/browse/SPARK-17323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17309) ALTER VIEW should throw exception if view not exist

2016-08-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17309:

Fix Version/s: 2.01

> ALTER VIEW should throw exception if view not exist
> ---
>
> Key: SPARK-17309
> URL: https://issues.apache.org/jira/browse/SPARK-17309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0, 2.01
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17180) Unable to Alter the Temporary View Using ALTER VIEW command

2016-08-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17180:

Fix Version/s: 2.0.1

> Unable to Alter the Temporary View Using ALTER VIEW command
> ---
>
> Key: SPARK-17180
> URL: https://issues.apache.org/jira/browse/SPARK-17180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>
> In the current master branch, when users do not specify the database name in 
> the `ALTER VIEW AS SELECT` command, we always try to alter the permanent view 
> even if the temporary view exists. 
> The expected behavior of `ALTER VIEW AS SELECT` should be like: alters the 
> temporary view if the temp view exists; otherwise, try to alter the permanent 
> view. This order is consistent with another command `DROP VIEW`, since users 
> are unable to specify the keyword TEMPORARY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17180) Unable to Alter the Temporary View Using ALTER VIEW command

2016-08-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17180:

Assignee: Wenchen Fan

> Unable to Alter the Temporary View Using ALTER VIEW command
> ---
>
> Key: SPARK-17180
> URL: https://issues.apache.org/jira/browse/SPARK-17180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>
> In the current master branch, when users do not specify the database name in 
> the `ALTER VIEW AS SELECT` command, we always try to alter the permanent view 
> even if the temporary view exists. 
> The expected behavior of `ALTER VIEW AS SELECT` should be like: alters the 
> temporary view if the temp view exists; otherwise, try to alter the permanent 
> view. This order is consistent with another command `DROP VIEW`, since users 
> are unable to specify the keyword TEMPORARY.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17351) Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17351:


Assignee: Josh Rosen  (was: Apache Spark)

> Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality
> 
>
> Key: SPARK-17351
> URL: https://issues.apache.org/jira/browse/SPARK-17351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> It would be useful if more of JDBCRDD's JDBC -> Spark SQL functionality was 
> usable from outside of JDBCRDD; this would make it easier to write test 
> harnesses comparing Spark output against other JDBC databases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17351) Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453896#comment-15453896
 ] 

Apache Spark commented on SPARK-17351:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14907

> Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality
> 
>
> Key: SPARK-17351
> URL: https://issues.apache.org/jira/browse/SPARK-17351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> It would be useful if more of JDBCRDD's JDBC -> Spark SQL functionality was 
> usable from outside of JDBCRDD; this would make it easier to write test 
> harnesses comparing Spark output against other JDBC databases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17351) Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17351:


Assignee: Apache Spark  (was: Josh Rosen)

> Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality
> 
>
> Key: SPARK-17351
> URL: https://issues.apache.org/jira/browse/SPARK-17351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> It would be useful if more of JDBCRDD's JDBC -> Spark SQL functionality was 
> usable from outside of JDBCRDD; this would make it easier to write test 
> harnesses comparing Spark output against other JDBC databases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17351) Refactor JDBCRDD to expose JDBC -> SparkSQL conversion functionality

2016-08-31 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17351:
--

 Summary: Refactor JDBCRDD to expose JDBC -> SparkSQL conversion 
functionality
 Key: SPARK-17351
 URL: https://issues.apache.org/jira/browse/SPARK-17351
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


It would be useful if more of JDBCRDD's JDBC -> Spark SQL functionality was 
usable from outside of JDBCRDD; this would make it easier to write test 
harnesses comparing Spark output against other JDBC databases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets

2016-08-31 Thread Sean McKibben (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453884#comment-15453884
 ] 

Sean McKibben commented on SPARK-17147:
---

I think Kafka's log compaction's design is still intended for sequential 
reading, even if the offsets are not consecutive for a compacted topic. Kafka's 
internal log cleaner process copies log segments to new files which have been 
compacted, so the messages are still stored sequentially even if the offset 
metadata for them increases by more than one. The typical consumer just does a 
poll() to get the next records, regardless of their offsets, but this Spark's 
CachedKafkaConsumer checks the offset of each record before calling poll(), and 
if that offset isn't the previous record's offset +1, it's going to call 
consumer.seek() before the next poll(), which I think is producing the dramatic 
slowdown I've seen.
It is certainly possible, using a non-Spark Kafka consumer, to get equivalent 
read speeds regardless of whether a topic is compacted.
I think the interplay between the CachedKafkaConsumer and the KafkaRDD might 
need to be adjusted. I haven't looked to see if more than one KafkaRDD will 
ever be asking for records from a single CachedKafkaConsumer instance, but 
since CachedKafkaConsumer was inspecting each offset to see if it was exactly 
the offset requested, and not just >= the requested offset, I'm guessing there 
was a reason. 

The main issue here is that it's becoming apparent that Kafka consumers can't 
assume monotonically incrementing offsets. Unfortunately that is an assumption 
that Spark-Kafka was making, and I think that assumption will need to be 
removed.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
> 
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17350) Disable default use of KryoSerializer in Thrift Server

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17350:


Assignee: Apache Spark  (was: Josh Rosen)

> Disable default use of KryoSerializer in Thrift Server
> --
>
> Key: SPARK-17350
> URL: https://issues.apache.org/jira/browse/SPARK-17350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> In SPARK-4761 (December 2014) we enabled Kryo serialization by default in the 
> Spark Thrift Server. However, I don't think that the original rationale for 
> doing this still holds as all Spark SQL serialization should now be performed 
> via efficient encoders and our UnsafeRow format. In addition, the use of Kryo 
> as the default serializer can introduce performance problems because the 
> creation of new KryoSerializer instances is expensive and we haven't 
> performed instance-reuse optimizations in several code paths (including 
> DirectTaskResult deserialization). Given all of this, I propose to revert 
> back to using JavaSerializer as the default serializer in the Thrift Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17350) Disable default use of KryoSerializer in Thrift Server

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453826#comment-15453826
 ] 

Apache Spark commented on SPARK-17350:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14906

> Disable default use of KryoSerializer in Thrift Server
> --
>
> Key: SPARK-17350
> URL: https://issues.apache.org/jira/browse/SPARK-17350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In SPARK-4761 (December 2014) we enabled Kryo serialization by default in the 
> Spark Thrift Server. However, I don't think that the original rationale for 
> doing this still holds as all Spark SQL serialization should now be performed 
> via efficient encoders and our UnsafeRow format. In addition, the use of Kryo 
> as the default serializer can introduce performance problems because the 
> creation of new KryoSerializer instances is expensive and we haven't 
> performed instance-reuse optimizations in several code paths (including 
> DirectTaskResult deserialization). Given all of this, I propose to revert 
> back to using JavaSerializer as the default serializer in the Thrift Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17350) Disable default use of KryoSerializer in Thrift Server

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17350:


Assignee: Josh Rosen  (was: Apache Spark)

> Disable default use of KryoSerializer in Thrift Server
> --
>
> Key: SPARK-17350
> URL: https://issues.apache.org/jira/browse/SPARK-17350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In SPARK-4761 (December 2014) we enabled Kryo serialization by default in the 
> Spark Thrift Server. However, I don't think that the original rationale for 
> doing this still holds as all Spark SQL serialization should now be performed 
> via efficient encoders and our UnsafeRow format. In addition, the use of Kryo 
> as the default serializer can introduce performance problems because the 
> creation of new KryoSerializer instances is expensive and we haven't 
> performed instance-reuse optimizations in several code paths (including 
> DirectTaskResult deserialization). Given all of this, I propose to revert 
> back to using JavaSerializer as the default serializer in the Thrift Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17350) Disable default use of KryoSerializer in Thrift Server

2016-08-31 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17350:
--

 Summary: Disable default use of KryoSerializer in Thrift Server
 Key: SPARK-17350
 URL: https://issues.apache.org/jira/browse/SPARK-17350
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


In SPARK-4761 (December 2014) we enabled Kryo serialization by default in the 
Spark Thrift Server. However, I don't think that the original rationale for 
doing this still holds as all Spark SQL serialization should now be performed 
via efficient encoders and our UnsafeRow format. In addition, the use of Kryo 
as the default serializer can introduce performance problems because the 
creation of new KryoSerializer instances is expensive and we haven't performed 
instance-reuse optimizations in several code paths (including DirectTaskResult 
deserialization). Given all of this, I propose to revert back to using 
JavaSerializer as the default serializer in the Thrift Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17341) Can't read Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453806#comment-15453806
 ] 

Don Drake commented on SPARK-17341:
---

I just downloaded the nightly build from 8/31/2016 and gave it a try.

And it worked:

{code}
scala> inSquare.take(2)
res2: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])

scala> inSquare.show(false)
+-+-+
|value|squared.value|
+-+-+
|1|1|
|2|4|
|3|9|
|4|16   |
|5|25   |
+-+-+
{code}

Thanks.

> Can't read Parquet data with fields containing periods "."
> --
>
> Key: SPARK-17341
> URL: https://issues.apache.org/jira/browse/SPARK-17341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>
> I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have 
> encountered a showstopper problem with Parquet dataset that have fields 
> containing a "." in a field name.  This data comes from an external provider 
> (CSV) and we just pass through the field names.  This has worked flawlessly 
> in Spark 1.5 and 1.6, but now spark can't seem to read these parquet files.  
> {code}
> Spark context available as 'sc' (master = local[*], app id = 
> local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * 
> i)).toDF("value", "squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> 

[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-31 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453801#comment-15453801
 ] 

Jason Moore commented on SPARK-17195:
-

That's right, and I totally agree that's where the fix needs to be.  And I'm 
pressing them to make this fix.  I guess that means that this ticket can be 
closed, as it seems a reasonable workaround within Spark itself isn't possible. 
 Once the TD driver has been fixed I'll return here to mention the version it 
is fixed in.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17318) Fix flaky test: o.a.s.repl.ReplSuite replicating blocks of object with class defined in repl

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17318:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix flaky test: o.a.s.repl.ReplSuite replicating blocks of object with class 
> defined in repl
> 
>
> Key: SPARK-17318
> URL: https://issues.apache.org/jira/browse/SPARK-17318
> Project: Spark
>  Issue Type: Test
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> There are a lot of failures recently: 
> http://spark-tests.appspot.com/tests/org.apache.spark.repl.ReplSuite/replicating%20blocks%20of%20object%20with%20class%20defined%20in%20repl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17318) Fix flaky test: o.a.s.repl.ReplSuite replicating blocks of object with class defined in repl

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453793#comment-15453793
 ] 

Apache Spark commented on SPARK-17318:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14905

> Fix flaky test: o.a.s.repl.ReplSuite replicating blocks of object with class 
> defined in repl
> 
>
> Key: SPARK-17318
> URL: https://issues.apache.org/jira/browse/SPARK-17318
> Project: Spark
>  Issue Type: Test
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> There are a lot of failures recently: 
> http://spark-tests.appspot.com/tests/org.apache.spark.repl.ReplSuite/replicating%20blocks%20of%20object%20with%20class%20defined%20in%20repl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17318) Fix flaky test: o.a.s.repl.ReplSuite replicating blocks of object with class defined in repl

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17318:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix flaky test: o.a.s.repl.ReplSuite replicating blocks of object with class 
> defined in repl
> 
>
> Key: SPARK-17318
> URL: https://issues.apache.org/jira/browse/SPARK-17318
> Project: Spark
>  Issue Type: Test
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> There are a lot of failures recently: 
> http://spark-tests.appspot.com/tests/org.apache.spark.repl.ReplSuite/replicating%20blocks%20of%20object%20with%20class%20defined%20in%20repl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17349) Update testthat package on Jenkins

2016-08-31 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453784#comment-15453784
 ] 

Hyukjin Kwon commented on SPARK-17349:
--

Cool!

> Update testthat package on Jenkins
> --
>
> Key: SPARK-17349
> URL: https://issues.apache.org/jira/browse/SPARK-17349
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Reporter: Shivaram Venkataraman
>Assignee: shane knapp
>Priority: Minor
>
> As per https://github.com/apache/spark/pull/14889#issuecomment-243697097 
> using version 1.0 of testthat will improve the messages printed at the end of 
> a test to include skipped tests etc.
> The current package version on Jenkins is 0.11.0 and we can upgrade this to 
> 1.0.0 if there are no conflicts etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453789#comment-15453789
 ] 

Cody Koeninger commented on SPARK-15406:


There's a big difference between continuing to publish an existing
implementation that is requires little to no maintenance, and putting new
development work into a dead version.

I don't think removing existing functionality for Kafka 0.8 users makes
sense. Nor do I think new development should be chained to old versions.

We already had this argument around the Kafka connector, and I believe it
was resolved in a sensible way - keep the old existing code around, but
stop putting effort into it.

On Aug 31, 2016 18:45, "Frederick Reiss (JIRA)"  wrote:


[ https://issues.apache.org/jira/browse/SPARK-15406?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel=15453749#comment-15453749 ]

Frederick Reiss commented on SPARK-15406:
-

WRT Kafka 0.8: I'm under the impression that there is a significant number
of Spark users who are still stuck with Kafka 0.8 or 0.9. Kafka sits
between multiple systems, so upgrades to production Kafka installations can
be challenging. Are other people monitoring this JIRA seeing a similar
situation, or are versions before 0.10 not in widespread use any more? If
older Kafka releases aren't relevant, then we should probably deprecate the
entire spark-streaming-kafka-0-8 component.

feel like time based indexing would make for a much better interface, but
it's been pushed back to kafka 0.10.1
33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17341) Can't read Parquet data with fields containing periods "."

2016-08-31 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453775#comment-15453775
 ] 

Hyukjin Kwon commented on SPARK-17341:
--

Ah, the issue itself seems not duplicated but the fix should address this 
together.

> Can't read Parquet data with fields containing periods "."
> --
>
> Key: SPARK-17341
> URL: https://issues.apache.org/jira/browse/SPARK-17341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>
> I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have 
> encountered a showstopper problem with Parquet dataset that have fields 
> containing a "." in a field name.  This data comes from an external provider 
> (CSV) and we just pass through the field names.  This has worked flawlessly 
> in Spark 1.5 and 1.6, but now spark can't seem to read these parquet files.  
> {code}
> Spark context available as 'sc' (master = local[*], app id = 
> local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * 
> i)).toDF("value", "squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> 

[jira] [Commented] (SPARK-17341) Can't read Parquet data with fields containing periods "."

2016-08-31 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453767#comment-15453767
 ] 

Hyukjin Kwon commented on SPARK-17341:
--

Is this a duplicate of SPARK-16698? I believe this does not happen in current 
master. Could you confirm it please?

> Can't read Parquet data with fields containing periods "."
> --
>
> Key: SPARK-17341
> URL: https://issues.apache.org/jira/browse/SPARK-17341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>
> I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have 
> encountered a showstopper problem with Parquet dataset that have fields 
> containing a "." in a field name.  This data comes from an external provider 
> (CSV) and we just pass through the field names.  This has worked flawlessly 
> in Spark 1.5 and 1.6, but now spark can't seem to read these parquet files.  
> {code}
> Spark context available as 'sc' (master = local[*], app id = 
> local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * 
> i)).toDF("value", "squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: 

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453749#comment-15453749
 ] 

Frederick Reiss commented on SPARK-15406:
-

WRT Kafka 0.8: I'm under the impression that there is a significant number of 
Spark users who are still stuck with Kafka 0.8 or 0.9. Kafka sits between 
multiple systems, so upgrades to production Kafka installations can be 
challenging. Are other people monitoring this JIRA seeing a similar situation, 
or are versions before 0.10 not in widespread use any more? If older Kafka 
releases aren't relevant, then we should probably deprecate the entire 
spark-streaming-kafka-0-8 component.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17342) Style of event timeline is broken

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453694#comment-15453694
 ] 

Apache Spark commented on SPARK-17342:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/14900

> Style of event timeline is broken
> -
>
> Key: SPARK-17342
> URL: https://issues.apache.org/jira/browse/SPARK-17342
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> SPARK-15373 updated the version of vis.js to 4.16.1. As of 4.0.0, some class 
> was renamed like 'timeline to vis-timeline' but that ticket didn't care and 
> now style is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17342) Style of event timeline is broken

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17342:


Assignee: (was: Apache Spark)

> Style of event timeline is broken
> -
>
> Key: SPARK-17342
> URL: https://issues.apache.org/jira/browse/SPARK-17342
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> SPARK-15373 updated the version of vis.js to 4.16.1. As of 4.0.0, some class 
> was renamed like 'timeline to vis-timeline' but that ticket didn't care and 
> now style is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17342) Style of event timeline is broken

2016-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17342:


Assignee: Apache Spark

> Style of event timeline is broken
> -
>
> Key: SPARK-17342
> URL: https://issues.apache.org/jira/browse/SPARK-17342
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> SPARK-15373 updated the version of vis.js to 4.16.1. As of 4.0.0, some class 
> was renamed like 'timeline to vis-timeline' but that ticket didn't care and 
> now style is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453592#comment-15453592
 ] 

Apache Spark commented on SPARK-16581:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/14904

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 2.0.1, 2.1.0
>
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17211) Broadcast join produces incorrect results on EMR with large driver memory

2016-08-31 Thread gurmukh singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453541#comment-15453541
 ] 

gurmukh singh edited comment on SPARK-17211 at 8/31/16 10:24 PM:
-

Hi

I can see this in Apache Spark 2.0 as well, running with same node 
configurations as mentioned above.

Apache Hadoop 2.72., Spark 2.0:
---

[hadoop@sp1 ~]$ hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2016-05-16T03:56Z
Compiled with protoc 2.5.0
>From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using 
/opt/cluster/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

[hadoop@sp1 ~]$ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Branch
Compiled by user jenkins on 2016-07-19T21:16:09Z
Revision

[hadoop@sp1 hadoop]$ spark-shell --master yarn --deploy-mode client 
--num-executors 20 --executor-cores 2 --executor-memory 12g --driver-memory 48g 
--conf spark.yarn.executor.memoryOverhead=4096 --conf 
spark.sql.shuffle.partitions=1024 --conf spark.yarn.maxAppAttempts=1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/31 04:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/08/31 04:29:49 WARN yarn.Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/08/31 04:30:19 WARN spark.SparkContext: Use an existing SparkContext, some 
configuration may not take effect.
Spark context Web UI available at http://10.0.0.227:4040
Spark context available as 'sc' (master = yarn, app id = 
application_1472617754154_0001).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val a1 = Array((123,1),(234,2),(432,5))
a1: Array[(Int, Int)] = Array((123,1), (234,2), (432,5))

scala> val a2 = Array(("abc",1),("bcd",2),("dcb",5))
a2: Array[(String, Int)] = Array((abc,1), (bcd,2), (dcb,5))

scala> val df1 = sc.parallelize(a1).toDF("gid","id")
df1: org.apache.spark.sql.DataFrame = [gid: int, id: int]

scala> val df2 = sc.parallelize(a2).toDF("gname","id")
df2: org.apache.spark.sql.DataFrame = [gname: string, id: int]


scala> df1.join(df2,"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123|  abc|
|  2|234|  bcd|
|  5|432|  dcb|
+---+---+-+

scala> val df2 = sc.parallelize(a2).toDF("gname","id")
df2: org.apache.spark.sql.DataFrame = [gname: string, id: int]

scala> df1.join(broadcast(df2),"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123| null|
|  2|234| null|
|  5|432| null|
+---+---+-+

scala> broadcast(df1).join(df2,"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  0|  1|  abc|
|  0|  2|  bcd|
|  0|  5|  dcb|
+---+---+-+

If I reduce the driver memory, this works as well. 

It works on Apache spark 1.6 

As lot of things have changed in Spark 2.0, it needs to be looked upon. It 
should give error or OOM, instead of returning NULL or ZERO values.

[~himanish] Although, it will be interesting to understand the use case that on 
a node with 61 GB, executing with driver memory=48GB, leaving just 12 GB for so 
many other things, when there are other overheads on the system.

On DataBricks, are you running with same parameters ?



was (Author: gurmukhd):
Hi

I can see this in Apache Spark 2.0 as well, running with same node 
configurations as mentioned above.

Apache Hadoop 2.72., Spark 2.0:
---

[hadoop@sp1 ~]$ hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2016-05-16T03:56Z
Compiled with protoc 2.5.0
>From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using 
/opt/cluster/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

[hadoop@sp1 ~]$ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Branch
Compiled by user jenkins on 2016-07-19T21:16:09Z
Revision

[hadoop@sp1 hadoop]$ spark-shell --master yarn --deploy-mode client 
--num-executors 20 --executor-cores 2 --executor-memory 12g --driver-memory 48g 
--conf spark.yarn.executor.memoryOverhead=4096 --conf 
spark.sql.shuffle.partitions=1024 --conf spark.yarn.maxAppAttempts=1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/31 04:29:48 WARN 

[jira] [Comment Edited] (SPARK-17211) Broadcast join produces incorrect results on EMR with large driver memory

2016-08-31 Thread gurmukh singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453541#comment-15453541
 ] 

gurmukh singh edited comment on SPARK-17211 at 8/31/16 10:22 PM:
-

Hi

I can see this in Apache Spark 2.0 as well, running with same node 
configurations as mentioned above.

Apache Hadoop 2.72., Spark 2.0:
---

[hadoop@sp1 ~]$ hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2016-05-16T03:56Z
Compiled with protoc 2.5.0
>From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using 
/opt/cluster/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

[hadoop@sp1 ~]$ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Branch
Compiled by user jenkins on 2016-07-19T21:16:09Z
Revision

[hadoop@sp1 hadoop]$ spark-shell --master yarn --deploy-mode client 
--num-executors 20 --executor-cores 2 --executor-memory 12g --driver-memory 48g 
--conf spark.yarn.executor.memoryOverhead=4096 --conf 
spark.sql.shuffle.partitions=1024 --conf spark.yarn.maxAppAttempts=1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/31 04:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/08/31 04:29:49 WARN yarn.Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/08/31 04:30:19 WARN spark.SparkContext: Use an existing SparkContext, some 
configuration may not take effect.
Spark context Web UI available at http://10.0.0.227:4040
Spark context available as 'sc' (master = yarn, app id = 
application_1472617754154_0001).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val a1 = Array((123,1),(234,2),(432,5))
a1: Array[(Int, Int)] = Array((123,1), (234,2), (432,5))

scala> val a2 = Array(("abc",1),("bcd",2),("dcb",5))
a2: Array[(String, Int)] = Array((abc,1), (bcd,2), (dcb,5))

scala> val df1 = sc.parallelize(a1).toDF("gid","id")
df1: org.apache.spark.sql.DataFrame = [gid: int, id: int]

scala> val df2 = sc.parallelize(a2).toDF("gname","id")
df2: org.apache.spark.sql.DataFrame = [gname: string, id: int]


scala> df1.join(df2,"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123|  abc|
|  2|234|  bcd|
|  5|432|  dcb|
+---+---+-+

scala> val df2 = sc.parallelize(a2).toDF("gname","id")
df2: org.apache.spark.sql.DataFrame = [gname: string, id: int]

scala> df1.join(broadcast(df2),"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123| null|
|  2|234| null|
|  5|432| null|
+---+---+-+

scala> broadcast(df1).join(df2,"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  0|  1|  abc|
|  0|  2|  bcd|
|  0|  5|  dcb|
+---+---+-+

If I reduce the driver memory, this works as well. 

It works on Apache spark 1.6 

As lot of things have changed in Spark 2.0, it needs to be looked upon. It 
should give error or OOM, instead of returning NULL or ZERO values.

[~himanish] Although, it will be interesting to understand the use case that on 
a node with 61 GB, executing with driver memory=48GB, leaving just 12 GB for so 
many other things, when there are other overheads on the system.




was (Author: gurmukhd):
Hi

I can see this in Apache Spark 2.0 as well, running with same node 
configurations as above mentioned.

Apache Hadoop 2.72., Spark 2.0:
---

[hadoop@sp1 ~]$ hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2016-05-16T03:56Z
Compiled with protoc 2.5.0
>From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using 
/opt/cluster/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

[hadoop@sp1 ~]$ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Branch
Compiled by user jenkins on 2016-07-19T21:16:09Z
Revision

[hadoop@sp1 hadoop]$ spark-shell --master yarn --deploy-mode client 
--num-executors 20 --executor-cores 2 --executor-memory 12g --driver-memory 48g 
--conf spark.yarn.executor.memoryOverhead=4096 --conf 
spark.sql.shuffle.partitions=1024 --conf spark.yarn.maxAppAttempts=1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/31 04:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library 

[jira] [Comment Edited] (SPARK-17211) Broadcast join produces incorrect results on EMR with large driver memory

2016-08-31 Thread gurmukh singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453541#comment-15453541
 ] 

gurmukh singh edited comment on SPARK-17211 at 8/31/16 10:19 PM:
-

Hi

I can see this in Apache Spark 2.0 as well, running with same node 
configurations as above mentioned.

Apache Hadoop 2.72., Spark 2.0:
---

[hadoop@sp1 ~]$ hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2016-05-16T03:56Z
Compiled with protoc 2.5.0
>From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using 
/opt/cluster/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

[hadoop@sp1 ~]$ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Branch
Compiled by user jenkins on 2016-07-19T21:16:09Z
Revision

[hadoop@sp1 hadoop]$ spark-shell --master yarn --deploy-mode client 
--num-executors 20 --executor-cores 2 --executor-memory 12g --driver-memory 48g 
--conf spark.yarn.executor.memoryOverhead=4096 --conf 
spark.sql.shuffle.partitions=1024 --conf spark.yarn.maxAppAttempts=1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/31 04:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/08/31 04:29:49 WARN yarn.Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/08/31 04:30:19 WARN spark.SparkContext: Use an existing SparkContext, some 
configuration may not take effect.
Spark context Web UI available at http://10.0.0.227:4040
Spark context available as 'sc' (master = yarn, app id = 
application_1472617754154_0001).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val a1 = Array((123,1),(234,2),(432,5))
a1: Array[(Int, Int)] = Array((123,1), (234,2), (432,5))

scala> val a2 = Array(("abc",1),("bcd",2),("dcb",5))
a2: Array[(String, Int)] = Array((abc,1), (bcd,2), (dcb,5))

scala> val df1 = sc.parallelize(a1).toDF("gid","id")
df1: org.apache.spark.sql.DataFrame = [gid: int, id: int]

scala> val df2 = sc.parallelize(a2).toDF("gname","id")
df2: org.apache.spark.sql.DataFrame = [gname: string, id: int]


scala> df1.join(df2,"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123|  abc|
|  2|234|  bcd|
|  5|432|  dcb|
+---+---+-+

scala> val df2 = sc.parallelize(a2).toDF("gname","id")
df2: org.apache.spark.sql.DataFrame = [gname: string, id: int]

scala> df1.join(broadcast(df2),"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123| null|
|  2|234| null|
|  5|432| null|
+---+---+-+

scala> broadcast(df1).join(df2,"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  0|  1|  abc|
|  0|  2|  bcd|
|  0|  5|  dcb|
+---+---+-+

If I reduce the driver memory, this works as well. 

It works on Apache spark 1.6 

As lot of things have changed in Spark 2.0, it needs to be looked upon. It 
should give error or OOM, instead of returning NULL or ZERO values.

[~himanish] Although, it will be interesting to understand the use case that on 
a node with 61 GB, executing with driver memory=48GB, leaving just 12 GB for so 
many other things, when there are other overheads on the system.




was (Author: gurmukhd):
Hi

I can see this in Apache Spark 2.0 as well running with same node 
configurations as above mentioned.

Apache Hadoop 2.72., Spark 2.0:
---

[hadoop@sp1 ~]$ hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2016-05-16T03:56Z
Compiled with protoc 2.5.0
>From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using 
/opt/cluster/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

[hadoop@sp1 ~]$ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Branch
Compiled by user jenkins on 2016-07-19T21:16:09Z
Revision

[hadoop@sp1 hadoop]$ spark-shell --master yarn --deploy-mode client 
--num-executors 20 --executor-cores 2 --executor-memory 12g --driver-memory 48g 
--conf spark.yarn.executor.memoryOverhead=4096 --conf 
spark.sql.shuffle.partitions=1024 --conf spark.yarn.maxAppAttempts=1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/31 04:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for 

[jira] [Commented] (SPARK-17211) Broadcast join produces incorrect results on EMR with large driver memory

2016-08-31 Thread gurmukh singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453541#comment-15453541
 ] 

gurmukh singh commented on SPARK-17211:
---

Hi

I can see this in Apache Spark 2.0 as well running with same node 
configurations as above mentioned.

Apache Hadoop 2.72., Spark 2.0:
---

[hadoop@sp1 ~]$ hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2016-05-16T03:56Z
Compiled with protoc 2.5.0
>From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using 
/opt/cluster/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar

[hadoop@sp1 ~]$ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Branch
Compiled by user jenkins on 2016-07-19T21:16:09Z
Revision

[hadoop@sp1 hadoop]$ spark-shell --master yarn --deploy-mode client 
--num-executors 20 --executor-cores 2 --executor-memory 12g --driver-memory 48g 
--conf spark.yarn.executor.memoryOverhead=4096 --conf 
spark.sql.shuffle.partitions=1024 --conf spark.yarn.maxAppAttempts=1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/31 04:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/08/31 04:29:49 WARN yarn.Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/08/31 04:30:19 WARN spark.SparkContext: Use an existing SparkContext, some 
configuration may not take effect.
Spark context Web UI available at http://10.0.0.227:4040
Spark context available as 'sc' (master = yarn, app id = 
application_1472617754154_0001).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val a1 = Array((123,1),(234,2),(432,5))
a1: Array[(Int, Int)] = Array((123,1), (234,2), (432,5))

scala> val a2 = Array(("abc",1),("bcd",2),("dcb",5))
a2: Array[(String, Int)] = Array((abc,1), (bcd,2), (dcb,5))

scala> val df1 = sc.parallelize(a1).toDF("gid","id")
df1: org.apache.spark.sql.DataFrame = [gid: int, id: int]

scala> val df2 = sc.parallelize(a2).toDF("gname","id")
df2: org.apache.spark.sql.DataFrame = [gname: string, id: int]


scala> df1.join(df2,"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123|  abc|
|  2|234|  bcd|
|  5|432|  dcb|
+---+---+-+

scala> val df2 = sc.parallelize(a2).toDF("gname","id")
df2: org.apache.spark.sql.DataFrame = [gname: string, id: int]

scala> df1.join(broadcast(df2),"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123| null|
|  2|234| null|
|  5|432| null|
+---+---+-+

scala> broadcast(df1).join(df2,"id").show()
+---+---+-+
| id|gid|gname|
+---+---+-+
|  0|  1|  abc|
|  0|  2|  bcd|
|  0|  5|  dcb|
+---+---+-+

If I reduce the driver memory, this works as well. 

As lot of things have changed in Spark 2.0, it needs to be looked upon. It 
should give error or OOM, instead of returning NULL or ZERO values.

[~himanish] Although, it will be interesting to understand the use case that on 
a node with 61 GB, executing with driver memory=48GB, leaving just 12 GB for so 
many other things, when there are other overheads on the system.

It works on Apache spark 1.6 

> Broadcast join produces incorrect results on EMR with large driver memory
> -
>
> Key: SPARK-17211
> URL: https://issues.apache.org/jira/browse/SPARK-17211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jarno Seppanen
>
> Broadcast join produces incorrect columns in join result, see below for an 
> example. The same join but without using broadcast gives the correct columns.
> Running PySpark on YARN on Amazon EMR 5.0.0.
> {noformat}
> import pyspark.sql.functions as func
> keys = [
> (5400, 0),
> (5401, 1),
> (5402, 2),
> ]
> keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
> keys_df.show()
> # ++-+
> # |  key_id|value|
> # ++-+
> # |5400|0|
> # |5401|1|
> # |5402|2|
> # ++-+
> data = [
> (5402,1),
> (5400,2),
> (5401,3),
> ]
> data_df = spark.createDataFrame(data, ['key_id', 'foo'])
> data_df.show()
> # ++---+  
> 
> # |  key_id|foo|
> # 

[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation

2016-08-31 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453516#comment-15453516
 ] 

Herman van Hovell commented on SPARK-17348:
---

This is an interesting one. TBH I have never seen such a query being used in 
practice. Could you tell me what the analyzed plan should look like, because 
the only solution to me would be to implicitly join {{t1}} to {{t2}} in the 
subquery. 

For 2.0.1 we should fail analysis in this case. We have an analyzer rule in 
place to make sure no one uses aggregates in combination with correlated scalar 
subqueries. We could extend that or move that into analysis. It would be nice 
to have a fix for 2.1.

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17348) Incorrect results from subquery transformation

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17348:
---
Labels: correctness  (was: )

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17316) Don't block StandaloneSchedulerBackend.executorRemoved

2016-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453471#comment-15453471
 ] 

Apache Spark commented on SPARK-17316:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14902

> Don't block StandaloneSchedulerBackend.executorRemoved
> --
>
> Key: SPARK-17316
> URL: https://issues.apache.org/jira/browse/SPARK-17316
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>
> StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It 
> may cause some deadlock since it's called inside 
> StandaloneAppClient.ClientEndpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16818) Exchange reuse incorrectly reuses scans over different sets of partitions

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16818:
---
Labels: correctness  (was: )

> Exchange reuse incorrectly reuses scans over different sets of partitions
> -
>
> Key: SPARK-16818
> URL: https://issues.apache.org/jira/browse/SPARK-16818
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> This happens because the file scan operator does not take into account 
> partition pruning in its implementation of `sameResult()`. As a result, 
> executions may be incorrect on self-joins over the same base file relation. 
> Here's a minimal test case to reproduce:
> {code}
> spark.conf.set("spark.sql.exchange.reuse", true)  // defaults to true in 
> 2.0
> withTempPath { path =>
>   val tempDir = path.getCanonicalPath
>   spark.range(10)
> .selectExpr("id % 2 as a", "id % 3 as b", "id as c")
> .write
> .partitionBy("a")
> .parquet(tempDir)
>   val df = spark.read.parquet(tempDir)
>   val df1 = df.where("a = 0").groupBy("b").agg("c" -> "sum")
>   val df2 = df.where("a = 1").groupBy("b").agg("c" -> "sum")
>   checkAnswer(df1.join(df2, "b"), Row(0, 6, 12) :: Row(1, 4, 8) :: Row(2, 
> 10, 5) :: Nil)
> {code}
> When exchange reuse is on, the result is
> {code}
> +---+--+--+
> |  b|sum(c)|sum(c)|
> +---+--+--+
> |  0| 6| 6|
> |  1| 4| 4|
> |  2|10|10|
> +---+--+--+
> {code}
> The correct result is
> {code}
> +---+--+--+
> |  b|sum(c)|sum(c)|
> +---+--+--+
> |  0| 6|12|
> |  1| 4| 8|
> |  2|10| 5|
> +---+--+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-08-31 Thread Matthew Seal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453034#comment-15453034
 ] 

Matthew Seal edited comment on SPARK-4105 at 8/31/16 9:42 PM:
--

Producible on 1.6.1 from a count() call after cache(df.repartition(2000)) 
against a dataset built from a PySpark generated RDD once I grew it into GB 
sizes (a few million rows). Smaller dataset sizes have consistently not 
triggered this issue.

{code}
java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:361)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
at 
org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

[jira] [Updated] (SPARK-17061) Incorrect results returned following a join of two datasets and a map step where total number of columns >100

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17061:
---
Labels: correctness  (was: )

> Incorrect results returned following a join of two datasets and a map step 
> where total number of columns >100
> -
>
> Key: SPARK-17061
> URL: https://issues.apache.org/jira/browse/SPARK-17061
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Jamie Hutton
>Assignee: Liwei Lin
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> We have hit a consistent bug where we have a dataset with more than 100 
> columns. I am raising as a blocker because spark is returning the WRONG 
> results rather than erroring, leading to data integrity issues
> I have put together the following test case which will show the issue (it 
> will run in spark-shell). In this example i am joining a dataset with lots of 
> fields onto another dataset. 
> The join works fine and if you show the dataset you will get the expected 
> result. However if you run a map step over the dataset you end up with a 
> strange error where the sequence that is in the right dataset now only 
> contains the last value.
> Whilst this test may seem a rather contrived example, what we are doing here 
> is a very standard analtical pattern. My original code was designed to:
>  - take a dataset of child records
>  - groupByKey up to the parent: giving a Dataset of (ParentID, Seq[Children])
>  - join the children onto the parent by parentID: giving 
> ((Parent),(ParentID,Seq[Children])
>  - map over the result to give a tuple of (Parent,Seq[Children])
> Notes:
> - The issue is resolved by having less fields - as soon as we go <= 100 the 
> integrity issue goes away. Try removing one of the fields from BigCaseClass 
> below
> - The issue will arise based on the total number of fields in the resulting 
> dataset. Below i have a small case class and a big case class, but two case 
> classes of 50 variable would give the same issue
> - the issue occurs where the case class being joined on (on the right) has a 
> case class type. It doesnt occur if you have a Seq[String]
> - If i go back to an RDD for the map step after the join i can workaround the 
> issue, but i lose all the benefits of datasets
> Scala code test case:
>   case class Name(name: String)
>   case class SmallCaseClass (joinkey: Integer, names: Seq[Name])
>   case class BigCaseClass  (field1: Integer,field2: Integer,field3: 
> Integer,field4: Integer,field5: Integer,field6: Integer,field7: 
> Integer,field8: Integer,field9: Integer,field10: Integer,field11: 
> Integer,field12: Integer,field13: Integer,field14: Integer,field15: 
> Integer,field16: Integer,field17: Integer,field18: Integer,field19: 
> Integer,field20: Integer,field21: Integer,field22: Integer,field23: 
> Integer,field24: Integer,field25: Integer,field26: Integer,field27: 
> Integer,field28: Integer,field29: Integer,field30: Integer,field31: 
> Integer,field32: Integer,field33: Integer,field34: Integer,field35: 
> Integer,field36: Integer,field37: Integer,field38: Integer,field39: 
> Integer,field40: Integer,field41: Integer,field42: Integer,field43: 
> Integer,field44: Integer,field45: Integer,field46: Integer,field47: 
> Integer,field48: Integer,field49: Integer,field50: Integer,field51: 
> Integer,field52: Integer,field53: Integer,field54: Integer,field55: 
> Integer,field56: Integer,field57: Integer,field58: Integer,field59: 
> Integer,field60: Integer,field61: Integer,field62: Integer,field63: 
> Integer,field64: Integer,field65: Integer,field66: Integer,field67: 
> Integer,field68: Integer,field69: Integer,field70: Integer,field71: 
> Integer,field72: Integer,field73: Integer,field74: Integer,field75: 
> Integer,field76: Integer,field77: Integer,field78: Integer,field79: 
> Integer,field80: Integer,field81: Integer,field82: Integer,field83: 
> Integer,field84: Integer,field85: Integer,field86: Integer,field87: 
> Integer,field88: Integer,field89: Integer,field90: Integer,field91: 
> Integer,field92: Integer,field93: Integer,field94: Integer,field95: 
> Integer,field96: Integer,field97: Integer,field98: Integer,field99: Integer)
>   
> val bigCC=Seq(BigCaseClass(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
> 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
> 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 
> 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 
> 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
> 91, 92, 93, 94, 95, 96, 97, 98, 99))
> 
> val 

[jira] [Updated] (SPARK-16721) Lead/lag needs to respect nulls

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16721:
---
Labels: correctness  (was: )

> Lead/lag needs to respect nulls 
> 
>
> Key: SPARK-16721
> URL: https://issues.apache.org/jira/browse/SPARK-16721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> Seems 2.0.0 changes the behavior of lead and lag to ignore nulls. This PR is 
> changing the behavior back to 1.6's behavior, which is respecting nulls.
> For example 
> {code}
> SELECT
> b,
> lag(a, 1, 321) OVER (ORDER BY b) as lag,
> lead(a, 1, 321) OVER (ORDER BY b) as lead
> FROM (SELECT cast(null as int) as a, 1 as b
> UNION ALL
> select cast(null as int) as id, 2 as b) tmp
> {code}
> This query should return 
> {code}
> +---+++
> |  b| lag|lead|
> +---+++
> |  1| 321|null|
> |  2|null| 321|
> +---+++
> {code}
> instead of 
> {code}
> +---+---++
> |  b|lag|lead|
> +---+---++
> |  1|321| 321|
> |  2|321| 321|
> +---+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17228) Not infer/propagate non-deterministic constraints

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17228:
---
Labels: correctness  (was: )

> Not infer/propagate non-deterministic constraints
> -
>
> Key: SPARK-17228
> URL: https://issues.apache.org/jira/browse/SPARK-17228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16837:
---
Labels: correctness  (was: )

> TimeWindow incorrectly drops slideDuration in constructors
> --
>
> Key: SPARK-16837
> URL: https://issues.apache.org/jira/browse/SPARK-16837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tom Magrino
>Assignee: Tom Magrino
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> Right now, the constructors for the TimeWindow expression in Catalyst 
> incorrectly uses the windowDuration in place of the slideDuration.  This will 
> cause incorrect windowing semantics after time window expressions are 
> analyzed by Catalyst.
> Relevant code is here: 
> https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17244) Joins should not pushdown non-deterministic conditions

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17244:
---
Labels: correctness  (was: )

> Joins should not pushdown non-deterministic conditions
> --
>
> Key: SPARK-17244
> URL: https://issues.apache.org/jira/browse/SPARK-17244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16994) Filter and limit are illegally permuted.

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16994:
---
Labels: correctness  (was: )

> Filter and limit are illegally permuted.
> 
>
> Key: SPARK-16994
> URL: https://issues.apache.org/jira/browse/SPARK-16994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: TobiasP
>Assignee: Reynold Xin
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> {noformat}
> scala> spark.createDataset(1 to 100).limit(10).filter($"value" % 10 === 
> 0).explain
> == Physical Plan ==
> CollectLimit 10
> +- *Filter ((value#875 % 10) = 0)
>+- LocalTableScan [value#875]
> scala> spark.createDataset(1 to 100).limit(10).filter($"value" % 10 === 
> 0).collect
> res23: Array[Int] = Array(10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17349) Update testthat package on Jenkins

2016-08-31 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-17349:
--
Assignee: shane knapp

> Update testthat package on Jenkins
> --
>
> Key: SPARK-17349
> URL: https://issues.apache.org/jira/browse/SPARK-17349
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Reporter: Shivaram Venkataraman
>Assignee: shane knapp
>Priority: Minor
>
> As per https://github.com/apache/spark/pull/14889#issuecomment-243697097 
> using version 1.0 of testthat will improve the messages printed at the end of 
> a test to include skipped tests etc.
> The current package version on Jenkins is 0.11.0 and we can upgrade this to 
> 1.0.0 if there are no conflicts etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12586) Wrong answer with registerTempTable and union sql query

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12586:
---
Description: 
The following sequence of sql(), registerTempTable() calls gets the wrong 
answer.
The correct answer is returned if the temp table is rewritten?

{code}
sql_text = """select row, col, foo, bar, value2 value
from (select row, col, foo, bar, 8 value2 from t0 where row=1 
and col=2) s1
  union select row, col, foo, bar, value from t0 where not 
(row=1 and col=2)"""
df2 = sqlContext.sql(sql_text)
df2.registerTempTable("t1")

# # The following 2 line workaround fixes the problem somehow?
# df3 = sqlContext.createDataFrame(df2.collect())
# df3.registerTempTable("t1")

# # The following 4 line workaround fixes the problem too..but takes way 
longer
# filename = "t1.json"
# df2.write.json(filename, mode='overwrite')
# df3 = sqlContext.read.json(filename)
# df3.registerTempTable("t1")

sql_text2 = """select row, col, v1 value from
(select v1 from
(select v_value v1 from values) s1
  left join
(select value v2,foo,bar,row,col from t1
  where foo=1
and bar=2 and value is not null) s2
  on v1=v2 where v2 is null
) sa join
(select row, col from t1 where foo=1
and bar=2 and value is null) sb"""
result = sqlContext.sql(sql_text2)
result.show()

# Expected result
# +---+---+-+
# |row|col|value|
# +---+---+-+
# |  3|  4|1|
# |  3|  4|2|
# |  3|  4|3|
# |  3|  4|4|
# +---+---+-+

# Getting this wrong result...when not using the workarounds above
# +---+---+-+
# |row|col|value|
# +---+---+-+
# +---+---+-+
{code}

  was:
The following sequence of sql(), registerTempTable() calls gets the wrong 
answer.
The correct answer is returned if the temp table is rewritten?

sql_text = """select row, col, foo, bar, value2 value
from (select row, col, foo, bar, 8 value2 from t0 where row=1 
and col=2) s1
  union select row, col, foo, bar, value from t0 where not 
(row=1 and col=2)"""
df2 = sqlContext.sql(sql_text)
df2.registerTempTable("t1")

# # The following 2 line workaround fixes the problem somehow?
# df3 = sqlContext.createDataFrame(df2.collect())
# df3.registerTempTable("t1")

# # The following 4 line workaround fixes the problem too..but takes way 
longer
# filename = "t1.json"
# df2.write.json(filename, mode='overwrite')
# df3 = sqlContext.read.json(filename)
# df3.registerTempTable("t1")

sql_text2 = """select row, col, v1 value from
(select v1 from
(select v_value v1 from values) s1
  left join
(select value v2,foo,bar,row,col from t1
  where foo=1
and bar=2 and value is not null) s2
  on v1=v2 where v2 is null
) sa join
(select row, col from t1 where foo=1
and bar=2 and value is null) sb"""
result = sqlContext.sql(sql_text2)
result.show()

# Expected result
# +---+---+-+
# |row|col|value|
# +---+---+-+
# |  3|  4|1|
# |  3|  4|2|
# |  3|  4|3|
# |  3|  4|4|
# +---+---+-+

# Getting this wrong result...when not using the workarounds above
# +---+---+-+
# |row|col|value|
# +---+---+-+
# +---+---+-+



> Wrong answer with registerTempTable and union sql query
> ---
>
> Key: SPARK-12586
> URL: https://issues.apache.org/jira/browse/SPARK-12586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Windows 7
>Reporter: shao lo
> Attachments: sql_bug.py
>
>
> The following sequence of sql(), registerTempTable() calls gets the wrong 
> answer.
> The correct answer is returned if the temp table is rewritten?
> {code}
> sql_text = """select row, col, foo, bar, value2 value
> from (select row, col, foo, bar, 8 value2 from t0 where row=1 
> and col=2) s1
>   union select row, col, foo, bar, value from t0 where 
> not (row=1 and col=2)"""
> df2 = sqlContext.sql(sql_text)
> df2.registerTempTable("t1")
> # # The following 2 line workaround fixes the problem somehow?
> # df3 = sqlContext.createDataFrame(df2.collect())
> # df3.registerTempTable("t1")
> # # The following 4 line workaround fixes the problem too..but takes way 
> longer
> # filename = "t1.json"
> # df2.write.json(filename, 

[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12030:
---
Labels: correctness  (was: )

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej BryƄski
>Assignee: Nong Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 1.5.3, 1.6.0
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11883) New Parquet reader generate wrong result

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11883:
---
Labels: correctness  (was: )

> New Parquet reader generate wrong result
> 
>
> Key: SPARK-11883
> URL: https://issues.apache.org/jira/browse/SPARK-11883
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Nong Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 1.6.0
>
>
> TPC-DS query Q98 generate wrong result with new parquet reader, works if we 
> turn it off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11949) Query on DataFrame from cube gives wrong results

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11949:
---
Labels: correctness dataframe sql  (was: dataframe sql)

> Query on DataFrame from cube gives wrong results
> 
>
> Key: SPARK-11949
> URL: https://issues.apache.org/jira/browse/SPARK-11949
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Veli Kerim Celik
>Assignee: Liang-Chi Hsieh
>  Labels: correctness, dataframe, sql
> Fix For: 1.6.0
>
>
> {code:title=Reproduce bug|borderStyle=solid}
> case class fact(date: Int, hour: Int, minute: Int, room_name: String, temp: 
> Double)
> val df0 = sc.parallelize(Seq
> (
> fact(20151123, 18, 35, "room1", 18.6),
> fact(20151123, 18, 35, "room2", 22.4),
> fact(20151123, 18, 36, "room1", 17.4),
> fact(20151123, 18, 36, "room2", 25.6)
> )).toDF()
> val cube0 = df0.cube("date", "hour", "minute", "room_name").agg(Map
> (
> "temp" -> "avg"
> ))
> cube0.where("date IS NULL").show()
> {code}
> The query result is empty. It should not be, because cube0 contains the value 
> null several times in column 'date'. The issue arises because the cube 
> function reuses the schema information from df0. If I change the type of 
> parameters in the case class to Option[T] the query gives correct results.
> Solution: The cube function should change the schema by changing the nullable 
> property to true, for the columns (dimensions) specified in the method call 
> parameters.
> I am new at Scala and Spark. I don't know how to implement this. Somebody 
> please do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13221:
---
Labels: correctness  (was: )

> GroupingSets Returns an Incorrect Results
> -
>
> Key: SPARK-13221
> URL: https://issues.apache.org/jira/browse/SPARK-13221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.0
>
>
> The following query returns a wrong result:
> {code}
> sql("select course, sum(earnings) as sum from courseSales group by course, 
> earnings" +
>  " grouping sets((), (course), (course, earnings))" +
>  " order by course, sum").show()
> {code}
> Before the fix, the results are like
> {code}
> [null,null]
> [Java,null]
> [Java,2.0]
> [Java,3.0]
> [dotNET,null]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> {code}
> After the fix, the results are corrected:
> {code}
> [null,113000.0]
> [Java,2.0]
> [Java,3.0]
> [Java,5.0]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> [dotNET,63000.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-08-31 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453412#comment-15453412
 ] 

Ofir Manor commented on SPARK-15406:


For me - structured streaming is currently all about real window operations 
based on event time (fields in the event), not processing time (already in 2.0 
with some limitations). In a future release it may also be about some new 
sink-related features (managing exactly-once from Spark to relational databases 
or HDFS, automatically doing upserts to databases).
So, I just want the same Kafka features as before - the value is the new 
processing capabilities, it just happens that my source of real-time events is 
Kafka,not Parquet files (as in 2.0).
I expect a couple of things. First, some basic config control like a pointer to 
Kafka (bootstrap servers), one or more topics, optionally an existing consumer 
group or an offset definition, optionally kerberised connection. I also expect 
exactly-once processing from Kafka to Spark (including correctly recovering 
after Spark node failure)

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17349) Update testthat package on Jenkins

2016-08-31 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-17349.
-
Resolution: Fixed

done!

> Update testthat package on Jenkins
> --
>
> Key: SPARK-17349
> URL: https://issues.apache.org/jira/browse/SPARK-17349
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
>
> As per https://github.com/apache/spark/pull/14889#issuecomment-243697097 
> using version 1.0 of testthat will improve the messages printed at the end of 
> a test to include skipped tests etc.
> The current package version on Jenkins is 0.11.0 and we can upgrade this to 
> 1.0.0 if there are no conflicts etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

2016-08-31 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453414#comment-15453414
 ] 

Barry Becker commented on SPARK-14234:
--

Is it a lot of work to backport this fix 1.6.3?
We have an app that requires it. We also require job-server and that does not 
look like it will be supporting 2.0.0 anytime soon.


> Executor crashes for TaskRunner thread interruption
> ---
>
> Key: SPARK-14234
> URL: https://issues.apache.org/jira/browse/SPARK-14234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Devaraj K
>Assignee: Devaraj K
> Fix For: 2.0.0
>
>
> If the TaskRunner thread gets interrupted while running due to task kill or 
> any other reason, the interrupted thread will try to update the task status 
> as part of the exception handling and fails with the below exception. This is 
> happening from all of these catch blocks statusUpdate calls, below are the 
> exceptions correspondingly for all these catch cases.
> {code:title=Executor.scala|borderStyle=solid}
> case _: TaskKilledException | _: InterruptedException if task.killed 
> =>
>  ..
> case cDE: CommitDeniedException =>
>  ..
> case t: Throwable =>
>  ..
> {code}
> {code:xml}
> 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-2,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at 
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>   at 
> org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47)
>   at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   ... 2 more
> {code}
> {code:xml}
> 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-4,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> 

[jira] [Commented] (SPARK-17349) Update testthat package on Jenkins

2016-08-31 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453397#comment-15453397
 ] 

Shivaram Venkataraman commented on SPARK-17349:
---

[~shaneknapp] Could we upgrade the testthat package on Jenkins ? Something like

{code}
Rscript -e 'install.packages("testthat", repos="http://cran.stat.ucla.edu/;)'
{code}

should do the trick I think

> Update testthat package on Jenkins
> --
>
> Key: SPARK-17349
> URL: https://issues.apache.org/jira/browse/SPARK-17349
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
>
> As per https://github.com/apache/spark/pull/14889#issuecomment-243697097 
> using version 1.0 of testthat will improve the messages printed at the end of 
> a test to include skipped tests etc.
> The current package version on Jenkins is 0.11.0 and we can upgrade this to 
> 1.0.0 if there are no conflicts etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6851) Wrong answers for self joins of converted parquet relations

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6851:
--
Labels: correctness  (was: )

> Wrong answers for self joins of converted parquet relations
> ---
>
> Key: SPARK-6851
> URL: https://issues.apache.org/jira/browse/SPARK-6851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>  Labels: correctness
> Fix For: 1.3.1, 1.4.0
>
>
> From the user list (
> /cc [~chinnitv])  When the same relation exists twice in a query plan, our 
> new caching logic replaces both instances with identical replacements.  The 
> bug can be see in the following transformation:
> {code}
> === Applying Rule 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions ===
> !Project [state#59,month#60]   
> 'Project [state#105,month#106]
> ! Join Inner, Some(((state#69 = state#59) && (month#70 = month#60)))'Join 
> Inner, Some(((state#105 = state#105) && (month#106 = month#106)))
> !  MetastoreRelation default, orders, None   
> Subquery orders
> !  Subquery ao
> Relation[id#97,category#98,make#99,type#100,price#101,pdate#102,customer#103,city#104,state#105,month#106]
>  org.apache.spark.sql.parquet.ParquetRelation2
> !   Distinct 
> Subquery ao
> !Project [state#69,month#70]  
> Distinct
> ! Join Inner, Some((id#81 = id#71))
> Project [state#105,month#106]
> !  MetastoreRelation default, orders, None  
> Join Inner, Some((id#115 = id#97))
> !  MetastoreRelation default, orderupdates, None 
> Subquery orders
> ! 
> Relation[id#97,category#98,make#99,type#100,price#101,pdate#102,customer#103,city#104,state#105,month#106]
>  org.apache.spark.sql.parquet.ParquetRelation2
> !
> Subquery orderupdates
> ! 
> Relation[id#115,category#116,make#117,type#118,price#119,pdate#120,customer#121,city#122,state#123,month#124]
>  org.apache.spark.sql.parquet.ParquetRelation2
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10169) Evaluating AggregateFunction1 (old code path) may return wrong answers when grouping expressions are used as arguments of aggregate functions

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10169:
---
Labels: correctness  (was: )

> Evaluating AggregateFunction1 (old code path) may return wrong answers when 
> grouping expressions are used as arguments of aggregate functions
> -
>
> Key: SPARK-10169
> URL: https://issues.apache.org/jira/browse/SPARK-10169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.2, 1.3.1, 1.4.1
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>  Labels: correctness
> Fix For: 1.3.2, 1.4.2
>
>
> Before Spark 1.5, if an aggregate function use an grouping expression as 
> input argument, the result of the query can be wrong. The reason is we are 
> using transformUp when we do aggregate results rewriting (see 
> https://github.com/apache/spark/blob/branch-1.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L154).
>  
> To reproduce the problem, you can use
> {code}
> import org.apache.spark.sql.functions._
> sc.parallelize((1 to 1000), 50).map(i => 
> Tuple1(i)).toDF("i").registerTempTable("t")
> sqlContext.sql(""" 
> select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
> from t
> where i % 10 = 5
> group by i % 10""").explain()
> == Physical Plan ==
> Aggregate false, [PartialGroup#234], [PartialGroup#234 AS 
> _c0#225,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf((PartialGroup#234
>  = 5),1,0), LongType)) AS _c1#226L,Coalesce(SUM(PartialCount#233L),0) AS 
> _c2#227L]
>  Exchange (HashPartitioning [PartialGroup#234], 200)
>   Aggregate true, [(i#191 % 10)], [(i#191 % 10) AS 
> PartialGroup#234,SUM(CAST(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf(((i#191
>  % 10) = 5),1,0), LongType)) AS PartialSum#232L,COUNT(1) AS PartialCount#233L]
>Project [_1#190 AS i#191]
> Filter ((_1#190 % 10) = 5)
>  PhysicalRDD [_1#190], MapPartitionsRDD[93] at mapPartitions at 
> ExistingRDD.scala:37
> sqlContext.sql(""" 
> select i % 10, sum(if(i % 10 = 5, 1, 0)), count(i)
> from t
> where i % 10 = 5
> group by i % 10""").show
> _c0 _c1 _c2
> 5   50  100
> {code}
> In Spark 1.5, new aggregation code path does not have the problem. The old 
> code path is fixed by 
> https://github.com/apache/spark/commit/dd9ae7945ab65d353ed2b113e0c1a00a0533ffd6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7965) Wrong answers for queries with multiple window specs in the same expression

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7965:
--
Labels: correctness  (was: )

> Wrong answers for queries with multiple window specs in the same expression
> ---
>
> Key: SPARK-7965
> URL: https://issues.apache.org/jira/browse/SPARK-7965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Yin Huai
>  Labels: correctness
> Fix For: 1.4.0
>
>
> I think that Spark SQL may be returning incorrect answers for queries that 
> use multiple window specifications within the same expression.  Here's an 
> example that illustrates the problem.
> Say that I have a table with a single numeric column and that I want to 
> compute a cumulative distribution function over this column.  Let's call this 
> table {{nums}}:
> {code}
> val nums = sc.parallelize(1 to 10).map(x => (x)).toDF("x")
> nums.registerTempTable("nums")
> {code}
> It's easy to compute a running sum over this column:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and current row) 
> from nums
> """).collect()
> nums: org.apache.spark.sql.DataFrame = [x: int]
> res29: Array[org.apache.spark.sql.Row] = Array([1], [3], [6], [10], [15], 
> [21], [28], [36], [45], [55])
> {code}
> It's also easy to compute a total sum over all rows:
> {code}
> sqlContext.sql("""
> select sum(x) over (rows between unbounded preceding and unbounded 
> following) from nums
> """).collect()
> res34: Array[org.apache.spark.sql.Row] = Array([55], [55], [55], [55], [55], 
> [55], [55], [55], [55], [55])
> {code}
> Let's say that I combine these expressions to compute a CDF:
> {code}
> sqlContext.sql("""
>   select (sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> from nums
> """).collect()
> res31: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [1.0], [1.0], [1.0], [1.0], [1.0], [1.0])
> {code}
> This seems wrong.  Note that if we combine the running total, global total, 
> and combined expression in the same query, then we see that the first two 
> values are computed correctly / but the combined expression seems to be 
> incorrect:
> {code}
> sqlContext.sql("""
> select
> sum(x) over (rows between unbounded preceding and current row) as 
> running_sum,
> (sum(x) over (rows between unbounded preceding and unbounded following)) 
> as total_sum,
> ((sum(x) over (rows between unbounded preceding and current row))
> /
> (sum(x) over (rows between unbounded preceding and unbounded following))) 
> as combined
> from nums 
> """).collect()
> res40: Array[org.apache.spark.sql.Row] = Array([1,55,1.0], [3,55,1.0], 
> [6,55,1.0], [10,55,1.0], [15,55,1.0], [21,55,1.0], [28,55,1.0], [36,55,1.0], 
> [45,55,1.0], [55,55,1.0])
> {code}
> /cc [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15706) Wrong Answer when using IF NOT EXISTS in INSERT OVERWRITE for DYNAMIC PARTITION

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-15706:
---
Labels: correctness  (was: )

> Wrong Answer when using IF NOT EXISTS in INSERT OVERWRITE for DYNAMIC 
> PARTITION
> ---
>
> Key: SPARK-15706
> URL: https://issues.apache.org/jira/browse/SPARK-15706
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>  Labels: correctness
> Fix For: 2.0.0
>
>
> IF NOT EXISTS in INSERT OVERWRITE should not support dynamic partitions. 
> However, we returns a wrong answer. 
> {noformat}
> CREATE TABLE table_with_partition(c1 string)
> PARTITIONED by (p1 string,p2 string)
> INSERT OVERWRITE TABLE table_with_partition
> partition (p1='a',p2='b')
> SELECT 'blarr2'
> INSERT OVERWRITE TABLE table_with_partition
> partition (p1='a',p2) IF NOT EXISTS
> SELECT 'blarr3'
> {noformat}
> {noformat}
> !== Correct Answer - 1 ==   == Spark Answer - 2 ==
>  [blarr2,a,b]   [blarr2,a,b]
> !   [blarr3,a,blarr3]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17093) Roundtrip encoding of array<struct<>> fields is wrong when whole-stage codegen is disabled

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17093:
---
Labels: correctness  (was: )

> Roundtrip encoding of array> fields is wrong when whole-stage 
> codegen is disabled
> --
>
> Key: SPARK-17093
> URL: https://issues.apache.org/jira/browse/SPARK-17093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Liwei Lin
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> The following failing test demonstrates a bug where Spark mis-encodes 
> array-of-struct fields if whole-stage codegen is disabled:
> {code}
> withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false") {
>   val data = Array(Array((1, 2), (3, 4)))
>   val ds = spark.sparkContext.parallelize(data).toDS()
>   assert(ds.collect() === data)
> }
> {code}
> When wholestage codegen is enabled (the default), this works fine. When it's 
> disabled, as in the test above, Spark returns {{Array(Array((3,4), (3,4)))}}. 
> Because the last element of the array appears to be repeated my best guess is 
> that the interpreted evaluation codepath forgot to {{copy()}} somewhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17060) Call inner join after outer join will miss rows with null values

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17060:
---
Labels: correctness join  (was: join)

> Call inner join after outer join will miss rows with null values
> 
>
> Key: SPARK-17060
> URL: https://issues.apache.org/jira/browse/SPARK-17060
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0, Mac, Local
>Reporter: Linbo
>  Labels: correctness, join
>
> {code:title=test.scala|borderStyle=solid}
> scala> val df1 = sc.parallelize(Seq((1, 2, 3), (3, 3, 3))).toDF("a", "b", "c")
> df1: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> val df2 = sc.parallelize(Seq((1, 2, 4), (4, 4, 4))).toDF("a", "b", "d")
> df2: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> val df3 = df1.join(df2, Seq("a", "b"), "outer")
> df3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields]
> scala> df3.show()
> +---+---+++
> |  a|  b|   c|   d|
> +---+---+++
> |  1|  2|   3|   4|
> |  3|  3|   3|null|
> |  4|  4|null|   4|
> +---+---+++
> scala> val df4 = sc.parallelize(Seq((1, 2, 5), (3, 3, 5), (4, 4, 
> 5))).toDF("a", "b", "e")
> df4: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> df4.show()
> +---+---+---+
> |  a|  b|  e|
> +---+---+---+
> |  1|  2|  5|
> |  3|  3|  5|
> |  4|  4|  5|
> +---+---+---+
> scala> df3.join(df4, Seq("a", "b"), "inner").show()
> +---+---+---+---+---+
> |  a|  b|  c|  d|  e|
> +---+---+---+---+---+
> |  1|  2|  3|  4|  5|
> +---+---+---+---+---+
> {code}
> If call persist on df3, the output is correct
> {code:title=test2.scala|borderStyle=solid}
> scala> df3.persist
> res32: df3.type = [a: int, b: int ... 2 more fields]
> scala> df3.join(df4, Seq("a", "b"), "inner").show()
> +---+---+++---+
> |  a|  b|   c|   d|  e|
> +---+---+++---+
> |  1|  2|   3|   4|  5|
> |  3|  3|   3|null|  5|
> |  4|  4|null|   4|  5|
> +---+---+++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16633) lag/lead using constant input values does not return the default value when the offset row does not exist

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16633:
---
Labels: correctness  (was: )

> lag/lead using constant input values does not return the default value when 
> the offset row does not exist
> -
>
> Key: SPARK-16633
> URL: https://issues.apache.org/jira/browse/SPARK-16633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
> Attachments: window_function_bug.html
>
>
> Please see the attached notebook. Seems lag/lead somehow fail to recognize 
> that a offset row does not exist and generate wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17120:
---
Labels: correctness  (was: )

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Xiao Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16991) Full outer join followed by inner join produces wrong results

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16991:
---
Labels: correctness  (was: )

> Full outer join followed by inner join produces wrong results
> -
>
> Key: SPARK-16991
> URL: https://issues.apache.org/jira/browse/SPARK-16991
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jonas Jarutis
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> I found strange behaviour using fullouter join in combination with inner 
> join. It seems that inner join can't match values correctly after full outer 
> join. Here is a reproducible example in spark 2.0.
> {code}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val a = Seq((1,2),(2,3)).toDF("a","b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2,5),(3,4)).toDF("a","c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val c = Seq((3,1)).toDF("a","d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter")
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+++
> |  a|   b|   c|
> +---+++
> |  1|   2|null|
> |  3|null|   4|
> |  2|   3|   5|
> +---+++
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> Meanwhile, without the full outer, inner join works fine.
> {code}
> scala> b.join(c, "a").show
> +---+---+---+
> |  a|  c|  d|
> +---+---+---+
> |  3|  4|  1|
> +---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17099:
---
Labels: correctness  (was: )

> Incorrect result when HAVING clause is added to group by query
> --
>
> Key: SPARK-17099
> URL: https://issues.apache.org/jira/browse/SPARK-17099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Xiao Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> Random query generation uncovered the following query which returns incorrect 
> results when run on Spark SQL. This wasn't the original query uncovered by 
> the generator, since I performed a bit of minimization to try to make it more 
> understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
>   Seq(
> (-769, -244),
> (-800, -409),
> (940, 86),
> (-507, 304),
> (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
>   (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
>  ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
>   ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
>  COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, 
> t2.int_col_2)) * 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four 
> rows. However, if I omit the {{HAVING}} clause I see that the group's rows 
> are being incorrectly filtered by the {{HAVING}} clause:
> {code}
> +--+---+--+
> | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2) 
>  |
> +--+---+--+
> | -507 | -1014
>  |
> | 940  | 1880 
>  |
> | -769 | -1538
>  |
> | -367 | -734 
>  |
> | -800 | -1600
>  |
> +--+---+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four 
> rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm 
> opening this bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17114) Adding a 'GROUP BY 1' where first column is literal results in wrong answer

2016-08-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17114:
---
Labels: correctness  (was: )

> Adding a 'GROUP BY 1' where first column is literal results in wrong answer
> ---
>
> Key: SPARK-17114
> URL: https://issues.apache.org/jira/browse/SPARK-17114
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>  Labels: correctness
>
> Consider the following example:
> {code}
> sc.parallelize(Seq(128, 256)).toDF("int_col").registerTempTable("mytable")
> // The following query should return an empty result set because the `IN` 
> filter condition is always false for this single-row table.
> val withoutGroupBy = sqlContext.sql("""
>   SELECT 'foo'
>   FROM mytable
>   WHERE int_col == 0
> """)
> assert(withoutGroupBy.collect().isEmpty, "original query returned wrong 
> answer")
> // After adding a 'GROUP BY 1' the query result should still be empty because 
> we'd be grouping an empty table:
> val withGroupBy = sqlContext.sql("""
>   SELECT 'foo'
>   FROM mytable
>   WHERE int_col == 0
>   GROUP BY 1
> """)
> assert(withGroupBy.collect().isEmpty, "adding GROUP BY resulted in wrong 
> answer")
> {code}
> Here, this fails the second assertion by returning a single row. It appears 
> that running {{group by 1}} where column 1 is a constant causes filter 
> conditions to be ignored.
> Both PostgreSQL and SQLite return empty result sets for the query containing 
> the {{GROUP BY}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17326) Tests with HiveContext in SparkR being skipped always

2016-08-31 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-17326.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14889
[https://github.com/apache/spark/pull/14889]

> Tests with HiveContext in SparkR being skipped always
> -
>
> Key: SPARK-17326
> URL: https://issues.apache.org/jira/browse/SPARK-17326
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Reporter: Hyukjin Kwon
> Fix For: 2.0.1, 2.1.0
>
>
> Currently, HiveContext in SparkR is not being tested and always skipped.
> This is because the initiation of {{TestHiveContext}} is being failed due to 
> trying to load non-existing data paths (test tables).
> This is introduced from https://github.com/apache/spark/pull/14005



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >