from:"Cheng Lian"


[ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126932#comment-15126932
 ] 

Cheng Lian commented on SPARK-13101:


The reason why 1.6.0 allows this illegal situation is that we didn't do 
nullability check there.

Another tricky thing here is about Parquet. When writing Parquet files, all 
non-nullable fields are converted to nullable fields intentionally. This 
behavior is for better interoperability with Hive. So in your case, after 
writing the {{Valuation}} records into a Parquet file and then reading them 
back, the {{valuations}} field becomes a nullable array. One possible 
workaround for your use case is to use {{Seq\[java.lang.Double\]}} for the 
{{valuations}} field.

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch


[ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126894#comment-15126894
 ] 

Cheng Lian edited comment on SPARK-13101 at 2/1/16 7:45 PM:


[~deenar] I also tried the last two snippets you provided and they both work. 
Maybe the issue you hit has already been fixed in branch-1.6? The commit I was 
using is 
https://github.com/apache/spark/commit/96e32db5cbd1ef32f65206357bfb8d9f70a06d0a


was (Author: lian cheng):
[~deenar] I also tried the last two snippets and they both work. Maybe the 
issue you hit has already been fixed in branch-1.6? The commit I was using is 
https://github.com/apache/spark/commit/96e32db5cbd1ef32f65206357bfb8d9f70a06d0a

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch


[ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126890#comment-15126890
 ] 

Cheng Lian edited comment on SPARK-13101 at 2/1/16 7:55 PM:


I think this is expected behavior. The problem in snippet mentioned in the JIRA 
description is that it tries to convert a nullable DataFrame field into a 
non-nullable Dataset field, which shouldn't be allowed. This can be further 
illustrated by the following spark-shell snippet:
{code}
// Converting non-nullable DF field to nullable DS field, OK
case class Rec(a: Array[Integer])
val df = Seq(Array(0)).map(Tuple1(_)).toDF("a")
df.printSchema
df.as[Rec]

// Converting nullable DF field to non-nullable DS field, failure
case class Rec(a: Array[Int])
val df = Seq(Array(0: Integer)).map(Tuple1(_)).toDF("a")
df.printSchema
df.as[Rec]
{code}


was (Author: lian cheng):
I think this is expected behavior. The problem in snippet mentioned in the JIRA 
description is that it tries to convert a nullable DataFrame field into a 
non-nullable Dataset field, which shouldn't be allowed. This can be further 
illustrated by the following spark-shell snippet:
{code}
// Converting non-nullable DF field to nullable DS field, OK
case class Rec(a: Array[Integer])
val df = Seq(Array(0)).map(Tuple1(_)).toDF("a")
df.printSchema
df.as[Rec]

// Converting nullable DF field to non-nullable DS field, failure
case class Rec(a: Array[Integer])
val df = Seq(Array(0)).map(Tuple1(_)).toDF("a")
df.printSchema
df.as[Rec]
{code}

> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12718) SQL generation support for window functions


 [ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-12718:
--

Assignee: Xiao Li

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (PARQUET-401) Deprecate Log and move to SLF4J Logger


[ 
https://issues.apache.org/jira/browse/PARQUET-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127668#comment-15127668
 ] 

Cheng Lian commented on PARQUET-401:


Fix of this issue is nice to have but probably shouldn't block 1.9.0.

> Deprecate Log and move to SLF4J Logger
> --
>
> Key: PARQUET-401
> URL: https://issues.apache.org/jira/browse/PARQUET-401
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Ryan Blue
>
> The current Log class is intended to allow swapping out logger back-ends, but 
> SLF4J already does this. It also doesn't expose as nice of an API as SLF4J, 
> which can handle formatting to avoid the cost of building log messages that 
> won't be used. I think we should deprecate the org.apache.parquet.Log class 
> and move to using SLF4J directly, instead of wrapping SLF4J (PARQUET-305).
> This will require deprecating the current Log class and replacing the current 
> uses of it with SLF4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-6319) Should throw analysis exception when using binary type in groupby/join


[ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126668#comment-15126668
 ] 

Cheng Lian commented on SPARK-6319:
---

One possible but not necessarily the best workaround is to have a UDF that maps 
your binary field to some other data type, which can be used in GROUP BY/JOIN.

> Should throw analysis exception when using binary type in groupby/join
> --
>
> Key: SPARK-6319
> URL: https://issues.apache.org/jira/browse/SPARK-6319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>    Reporter: Cheng Lian
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 1.5.0
>
>
> Spark shell session for reproduction:
> {noformat}
> scala> import sqlContext.implicits._
> scala> import org.apache.spark.sql.types._
> scala> Seq(1, 1, 2, 2).map(i => Tuple1(i.toString)).toDF("c").select($"c" 
> cast BinaryType).distinct.show()
> ...
> CAST(c, BinaryType)
> [B@43f13160
> [B@5018b648
> [B@3be22500
> [B@476fc8a1
> {noformat}
> Spark SQL uses plain byte arrays to represent binary values. However, arrays 
> are compared by reference rather than by value. On the other hand, the 
> DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
> for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (PARQUET-495) Fix mismatches in Types class comments


 [ 
https://issues.apache.org/jira/browse/PARQUET-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-495.

Resolution: Fixed

Issue resolved by pull request 317
[https://github.com/apache/parquet-mr/pull/317]

> Fix mismatches in Types class comments
> --
>
> Key: PARQUET-495
> URL: https://issues.apache.org/jira/browse/PARQUET-495
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Trivial
> Fix For: 1.9.0
>
>
> To produce:
> required group User \{
> required int64 id;
> *optional* binary email (UTF8);
> \}
> we should do:
> Types.requiredGroup()
>   .required(INT64).named("id")
>   .-*required* (BINARY).as(UTF8).named("email")-
>   .*optional* (BINARY).as(UTF8).named("email")
>   .named("User")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-13101) Dataset complex types mapping to DataFrame (element nullability) mismatch


 [ 
https://issues.apache.org/jira/browse/SPARK-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13101:
---
Description: 
There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
nullable element
{noformat}
 |-- valuations: array (nullable = true)
 ||-- element: double (containsNull = true)
{noformat}
This could be read back to as a Dataset in Spark 1.6.0
{code}
val df = sqlContext.table("valuations").as[Valuation]
{code}
But with Spark 1.6.1 the same fails with
{code}
val df = sqlContext.table("valuations").as[Valuation]

org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
array)' due to data type mismatch: cannot cast 
ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
{code}
Here's the classes I am using
{code}
case class Valuation(tradeId : String,
 counterparty: String,
 nettingAgreement: String,
 wrongWay: Boolean,
 valuations : Seq[Double], /* one per scenario */
 timeInterval: Int,
 jobId: String)  /* used for hdfs partitioning */

val vals : Seq[Valuation] = Seq()
val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
{code}
even the following gives the same result
{code}
val valsDF = vals.toDS.toDF
{code}


  was:
There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
default a scala Seq[Double] is mapped by Spark as an ArrayType with nullable 
element

 |-- valuations: array (nullable = true)
 ||-- element: double (containsNull = true)

This could be read back to as a Dataset in Spark 1.6.0

val df = sqlContext.table("valuations").as[Valuation]

But with Spark 1.6.1 the same fails with
val df = sqlContext.table("valuations").as[Valuation]

org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
array)' due to data type mismatch: cannot cast 
ArrayType(DoubleType,true) to ArrayType(DoubleType,false);

Here's the classes I am using

case class Valuation(tradeId : String,
 counterparty: String,
 nettingAgreement: String,
 wrongWay: Boolean,
 valuations : Seq[Double], /* one per scenario */
 timeInterval: Int,
 jobId: String)  /* used for hdfs partitioning */

val vals : Seq[Valuation] = Seq()
val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")

even the following gives the same result
val valsDF = vals.toDS.toDF



> Dataset complex types mapping to DataFrame  (element nullability) mismatch
> --
>
> Key: SPARK-13101
> URL: https://issues.apache.org/jira/browse/SPARK-13101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Deenar Toraskar
>Priority: Blocker
>
> There seems to be a regression between 1.6.0 and 1.6.1 (snapshot build). By 
> default a scala {{Seq\[Double\]}} is mapped by Spark as an ArrayType with 
> nullable element
> {noformat}
>  |-- valuations: array (nullable = true)
>  ||-- element: double (containsNull = true)
> {noformat}
> This could be read back to as a Dataset in Spark 1.6.0
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> {code}
> But with Spark 1.6.1 the same fails with
> {code}
> val df = sqlContext.table("valuations").as[Valuation]
> org.apache.spark.sql.AnalysisException: cannot resolve 'cast(valuations as 
> array)' due to data type mismatch: cannot cast 
> ArrayType(DoubleType,true) to ArrayType(DoubleType,false);
> {code}
> Here's the classes I am using
> {code}
> case class Valuation(tradeId : String,
>  counterparty: String,
>  nettingAgreement: String,
>  wrongWay: Boolean,
>  valuations : Seq[Double], /* one per scenario */
>  timeInterval: Int,
>  jobId: String)  /* used for hdfs partitioning */
> val vals : Seq[Valuation] = Seq()
> val valsDF = sqlContext.sparkContext.parallelize(vals).toDF
> valsDF.write.partitionBy("jobId").mode(SaveMode.Overwrite).saveAsTable("valuations")
> {code}
> even the following gives the same result
> {code}
> val valsDF = vals.toDS.toDF
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

2016-01-31 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15125587#comment-15125587
 ] 

Cheng Lian commented on SPARK-12725:


There are other analysis rules that may use generated attributes (e.g., 
{{DistinctAggregationRewriter}}). I think a generic approach is better than 
special casing them one by one.

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Lian
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.
> Here's an example Spark 1.6.0 snippet for illustration:
> {code}
> sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
> sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), 
> COUNT(b)").explain(true)
> {code}
> The above code produces the following resolved plan:
> {noformat}
> == Analyzed Logical Plan ==
> _c0: bigint
> Project [_c0#101L]
> +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
>+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) 
> AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS 
> aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
>   +- Subquery t
>  +- Project [id#46L AS a#47L,id#46L AS b#48L]
> +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at 
> :26
> {noformat}
> Here we can see that both aggregate expressions in {{ORDER BY}} are extracted 
> into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with 
> different expression IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

2016-01-31 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12725:
---
Description: 
Some analysis rules generate auxiliary attribute references with the same name 
but different expression IDs. For example, {{ResolveAggregateFunctions}} 
introduces {{havingCondition}} and {{aggOrder}}, and 
{{DistinctAggregationRewriter}} introduces {{gid}}.

This is OK for normal query execution since these attribute references get 
expression IDs. However, it's troublesome when converting resolved query plans 
back to SQL query strings since expression IDs are erased.

Here's an example Spark 1.6.0 snippet for illustration:
{code}
sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), 
COUNT(b)").explain(true)
{code}
The above code produces the following resolved plan:
{noformat}
== Analyzed Logical Plan ==
_c0: bigint
Project [_c0#101L]
+- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
   +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS 
_c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS 
aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
  +- Subquery t
 +- Project [id#46L AS a#47L,id#46L AS b#48L]
+- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at 
:26
{noformat}
Here we can see that both aggregate expressions in {{ORDER BY}} are extracted 
into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with 
different expression IDs.

  was:
Some analysis rules generate auxiliary attribute references with the same name 
but different expression IDs. For example, {{ResolveAggregateFunctions}} 
introduces {{havingCondition}} and {{aggOrder}}, and 
{{DistinctAggregationRewriter}} introduces {{gid}}.

This is OK for normal query execution since these attribute references get 
expression IDs. However, it's troublesome when converting resolved query plans 
back to SQL query strings since expression IDs are erased.


> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.
> Here's an example Spark 1.6.0 snippet for illustration:
> {code}
> sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
> sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), 
> COUNT(b)").explain(true)
> {code}
> The above code produces the following resolved plan:
> {noformat}
> == Analyzed Logical Plan ==
> _c0: bigint
> Project [_c0#101L]
> +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
>+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) 
> AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS 
> aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
>   +- Subquery t
>  +- Project [id#46L AS a#47L,id#46L AS b#48L]
> +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at 
> :26
> {noformat}
> Here we can see that both aggregate expressions in {{ORDER BY}} are extracted 
> into an {{Aggregate}} operator, and both of them are named {{aggOrder}} with 
> different expression IDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12624) When schema is specified, we should give better error message if actual row length doesn't match

2016-01-29 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124556#comment-15124556
 ] 

Cheng Lian commented on SPARK-12624:


Yes, it should.

> When schema is specified, we should give better error message if actual row 
> length doesn't match
> 
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.6.1, 2.0.0
>
>
> The following code snippet reproduces this issue:
> {code}
> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
> from pyspark.sql.types import Row
> schema = StructType([StructField("a", IntegerType()), StructField("b", 
> StringType())])
> rdd = sc.parallelize(range(10)).map(lambda x: Row(a=x))
> df = sqlContext.createDataFrame(rdd, schema)
> df.show()
> {code}
> An unintuitive {{ArrayIndexOutOfBoundsException}} exception is thrown in this 
> case:
> {code}
> ...
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:227)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:35)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:36)
> ...
> {code}
> We should give a better error message here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (PARQUET-432) Complete a todo for method ColumnDescriptor.compareTo()

2016-01-29 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved PARQUET-432.

Resolution: Fixed

Issue resolved by pull request 314
[https://github.com/apache/parquet-mr/pull/314]

> Complete a todo for method ColumnDescriptor.compareTo()
> ---
>
> Key: PARQUET-432
> URL: https://issues.apache.org/jira/browse/PARQUET-432
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.8.1
>Reporter: Liwei Lin
>Assignee: Liwei Lin
>Priority: Minor
> Fix For: 1.9.0
>
>
> The ticket proposes to consider the case *path.length < o.path.length* in, 
> for method ColumnDescriptor.compareTo().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (SPARK-13050) Scalatest tags fail builds with the addition of the sketch module


 [ 
https://issues.apache.org/jira/browse/SPARK-13050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13050.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10954
[https://github.com/apache/spark/pull/10954]

> Scalatest tags fail builds with the addition of the sketch module
> -
>
> Key: SPARK-13050
> URL: https://issues.apache.org/jira/browse/SPARK-13050
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> Builds fail at the new sketch module when a scalatest tag is used. Found when 
> using "-Dtest.exclude.tags=org.apache.spark.tags.DockerTest" since Docker 
> isn't install-able on CentOS 6.
> {noformat}
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  2.815 
> s]
> [INFO] Spark Project Sketch ... FAILURE [  1.148 
> s]
> [INFO] Spark Project Test Tags  SKIPPED
> [INFO] Spark Project Launcher . SKIPPED
> [INFO] Spark Project Networking ... SKIPPED
> [INFO] Spark Project Shuffle Streaming Service  SKIPPED
> [INFO] Spark Project Unsafe ... SKIPPED
> [INFO] Spark Project Core . SKIPPED
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project Docker Integration Tests . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project YARN Shuffle Service . SKIPPED
> [INFO] Spark Project YARN . SKIPPED
> [INFO] Spark Project Hive Thrift Server ... SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Project External Twitter . SKIPPED
> [INFO] Spark Project External Flume Sink .. SKIPPED
> [INFO] Spark Project External Flume ... SKIPPED
> [INFO] Spark Project External Flume Assembly .. SKIPPED
> [INFO] Spark Project External Akka  SKIPPED
> [INFO] Spark Project External MQTT  SKIPPED
> [INFO] Spark Project External MQTT Assembly ... SKIPPED
> [INFO] Spark Project External ZeroMQ .. SKIPPED
> [INFO] Spark Project External Kafka ... SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 4.909 s
> [INFO] Finished at: 2016-01-27T12:40:12-08:00
> [INFO] Final Memory: 47M/456M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on 
> project spark-sketch_2.10: Execution default-test of goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test failed: There was 
> an error in the forked process
> [ERROR] java.lang.RuntimeException: Unable to load category: 
> org.apache.spark.tags.DockerTest
> [ERROR] at 
> org.apache.maven.surefire.group.match.SingleGroupMatcher.loadGroupClasses(SingleGroupMatcher.java:157)
> [ERROR] at 
> org.apache.maven.surefire.common.junit48.FilterFactory.createGroupFilter(FilterFactory.java:93)
> [ERROR] at 
> org.apache.maven.surefire.junitcore.JUnitCoreProvider.createJUnit48Filter(JUnitCoreProvider.java:202)
> [ERROR] at 
> org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:119)
> [ERROR] at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBoote

[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules


[ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122013#comment-15122013
 ] 

Cheng Lian commented on SPARK-12725:


One possible solution I was thinking about is that we can add a new 
{{Attribute}} class named {{GeneratedAttributeRef}}, which is exactly the same 
as {{AttributeReference}} except that it's {{sql}} representation includes 
expression ID (e.g. {{gid_42}} instead of {{gid}}). To avoid code duplication, 
we can extract common code into an abstract class, say {{AbstractAttributeRef}}.

[~yhuai] [~rxin] [~marmbrus] What do you think? 

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Lian
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12723) Comprehensive SQL generation support for expressions


[ 
https://issues.apache.org/jira/browse/SPARK-12723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122102#comment-15122102
 ] 

Cheng Lian commented on SPARK-12723:


Most of this issue is probably fixed in SPARK-12799 and PR #10757. But we 
should do a thorough check.

> Comprehensive SQL generation support for expressions
> 
>
> Key: SPARK-12723
> URL: https://issues.apache.org/jira/browse/SPARK-12723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>
> Ensure that all built-in expressions can be mapped to its SQL representation 
> if there is one (e.g. ScalaUDF doesn't have a SQL representation).
> A (possibly incomplete) list of unsupported expressions is provided in PR 
> description of [PR #10541|https://github.com/apache/spark/pull/10541]:
> - Math expressions
> - String expressions
> - Null expressions
> - Calendar interval literal
> - Part of date time expressions
> - Complex type creators
> - Special NOT expressions, e.g. NOT LIKE and NOT IN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12727) SQL generation support for distinct aggregation patterns that fit DistinctAggregationRewriter analysis rule


[ 
https://issues.apache.org/jira/browse/SPARK-12727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122124#comment-15122124
 ] 

Cheng Lian commented on SPARK-12727:


It would be nice if we can recover the original distinct aggregation SQL query. 
But we can also call it a done if we simply wrap all unrecognized patterns into 
subqueries. In this way the generated SQL query string can be quite verbose, 
but the optimizer will do the hard work.

> SQL generation support for distinct aggregation patterns that fit 
> DistinctAggregationRewriter analysis rule
> ---
>
> Key: SPARK-12727
> URL: https://issues.apache.org/jira/browse/SPARK-12727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11012) Canonicalize view definitions


 [ 
https://issues.apache.org/jira/browse/SPARK-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11012:
---
Description: 
In SPARK-10337, we added the first step of supporting view natively, which is 
basically wrapping the original view definition SQL text with an extra 
{{SELECT}} and then store the wrapped SQL text into metastore. This approach 
suffers at least two issues:

# Switching current database may break view queries
# HiveQL doesn't allow CTE as subquery, thus CTE can't be used in view 
definition

To fix these issues, we need to canonicalize the view definition. For example, 
for a SQL string
{code:sql}
SELECT a, b FROM table
{code}
we will save this text to Hive metastore as
{code:sql}
SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`
{code}

The core infrastructure of this work is SQL query string generation 
(SPARK-12593).  Namely, converting resolved logical query plans back to 
canonicalized SQL query strings. [PR 
#10541|https://github.com/apache/spark/pull/10541] set up basic infrastructure 
of SQL generation, but more language structures need to be supported.

[PR #10541|https://github.com/apache/spark/pull/10541] added round-trip testing 
infrastructure for SQL generation.  All queries tested by test suites extending 
{{HiveComparisonTest}} are executed in the following order:

# Parsing query string to logical plan
# Converting resolved logical plan back to canonicalized SQL query string
# Executing generated SQL query string
# Comparing query results with golden answers

Note that not all resolved logical query plan can be converted back to SQL 
query string.  Either because it consists of some language structure that has 
not been supported yet, or it doesn't have a SQL representation inherently 
(e.g. query plans built on top of local Scala collections).

If a logical plan is inconvertible, {{HiveComparisonTest}} falls back to its 
original behavior, namely executing the original SQL query string and compare 
the results with golden answers.

SQL generation details are logged and can be found in 
{{sql/hive/target/unit-tests.log}} (log level should be at least DEBUG).

  was:
In SPARK-10337, we added the first step of supporting view natively, which is 
basically wrapping the original view definition SQL text with an extra 
{{SELECT}} and then store the wrapped SQL text into metastore. This approach 
suffers at least two issues:

# Switching current database may break view queries
# HiveQL doesn't allow CTE as subquery, thus CTE can't be used in view 
definition

To fix these issues, we need to canonicalize the view definition. For example, 
for a SQL string
{code:sql}
SELECT a, b FROM table
{code}
we will save this text to Hive metastore as
{code:sql}
SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`
{code}

The core infrastructure of this work is SQL query string generation 
(SPARK-12593).  Namely, converting resolved logical query plans back to 
canonicalized SQL query strings. [PR 
#10541|https://github.com/apache/spark/pull/10541] set up basic infrastructure 
of SQL generation, but more language structures need to be supported.


> Canonicalize view definitions
> -
>
> Key: SPARK-11012
> URL: https://issues.apache.org/jira/browse/SPARK-11012
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In SPARK-10337, we added the first step of supporting view natively, which is 
> basically wrapping the original view definition SQL text with an extra 
> {{SELECT}} and then store the wrapped SQL text into metastore. This approach 
> suffers at least two issues:
> # Switching current database may break view queries
> # HiveQL doesn't allow CTE as subquery, thus CTE can't be used in view 
> definition
> To fix these issues, we need to canonicalize the view definition. For 
> example, for a SQL string
> {code:sql}
> SELECT a, b FROM table
> {code}
> we will save this text to Hive metastore as
> {code:sql}
> SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`
> {code}
> The core infrastructure of this work is SQL query string generation 
> (SPARK-12593).  Namely, converting resolved logical query plans back to 
> canonicalized SQL query strings. [PR 
> #10541|https://github.com/apache/spark/pull/10541] set up basic 
> infrastructure of SQL generation, but more language structures need to be 
> supported.
> [PR #10541|https://github.com/apache/spark/pull/10541] added round-trip 
> testing infrastructure for SQL generation.  All queries tested by test suites 
> extending {{HiveComparisonTest}} are executed in the following order:
> # Parsing query string to logical plan
> # Converting resolved logica

[jira] [Updated] (SPARK-12719) SQL generation support for generators (including UDTF)


 [ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12719:
---
Description: 
{{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
Please refer to SPARK-11012 for more details.


> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12721) SQL generation support for script transformation


 [ 
https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12721:
---
Description: 
{{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
Please refer to SPARK-11012 for more details.


> SQL generation support for script transformation
> 
>
> Key: SPARK-12721
> URL: https://issues.apache.org/jira/browse/SPARK-12721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12720) SQL generation support for cube, rollup, and grouping set


 [ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12720:
---
Description: 
{{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
Please refer to SPARK-11012 for more details.


> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12718) SQL generation support for window functions


 [ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12718:
---
Description: {{HiveWindowFunctionQuerySuite}} and 
{{HiveWindowFunctionQueryFileSuite}} can be useful for bootstrapping test 
coverage. Please refer to SPARK-11012 for more details.

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>
> {{HiveWindowFunctionQuerySuite}} and {{HiveWindowFunctionQueryFileSuite}} can 
> be useful for bootstrapping test coverage. Please refer to SPARK-11012 for 
> more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12401) Add support for enums in postgres


 [ 
https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12401.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10596
[https://github.com/apache/spark/pull/10596]

> Add support for enums in postgres
> -
>
> Key: SPARK-12401
> URL: https://issues.apache.org/jira/browse/SPARK-12401
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jaka Jancar
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.0.0
>
>
> JSON and JSONB types [are now 
> converted|https://github.com/apache/spark/pull/8948/files] into strings on 
> the Spark side instead of throwing. It would be great it [enumerated 
> types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were 
> treated similarly instead of failing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12401) Add support for enums in postgres


 [ 
https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12401:
---
Assignee: Takeshi Yamamuro

> Add support for enums in postgres
> -
>
> Key: SPARK-12401
> URL: https://issues.apache.org/jira/browse/SPARK-12401
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jaka Jancar
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.0.0
>
>
> JSON and JSONB types [are now 
> converted|https://github.com/apache/spark/pull/8948/files] into strings on 
> the Spark side instead of throwing. It would be great it [enumerated 
> types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were 
> treated similarly instead of failing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13050) Scalatest tags fail builds with the addition of the sketch module


 [ 
https://issues.apache.org/jira/browse/SPARK-13050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13050:
---
Description: 
Builds fail at the new sketch module when a scalatest tag is used. Found when 
using "-Dtest.exclude.tags=org.apache.spark.tags.DockerTest" since Docker isn't 
install-able on CentOS 6.
{noformat}
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  2.815 s]
[INFO] Spark Project Sketch ... FAILURE [  1.148 s]
[INFO] Spark Project Test Tags  SKIPPED
[INFO] Spark Project Launcher . SKIPPED
[INFO] Spark Project Networking ... SKIPPED
[INFO] Spark Project Shuffle Streaming Service  SKIPPED
[INFO] Spark Project Unsafe ... SKIPPED
[INFO] Spark Project Core . SKIPPED
[INFO] Spark Project GraphX ... SKIPPED
[INFO] Spark Project Streaming  SKIPPED
[INFO] Spark Project Catalyst . SKIPPED
[INFO] Spark Project SQL .. SKIPPED
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project Docker Integration Tests . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project YARN Shuffle Service . SKIPPED
[INFO] Spark Project YARN . SKIPPED
[INFO] Spark Project Hive Thrift Server ... SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Project External Twitter . SKIPPED
[INFO] Spark Project External Flume Sink .. SKIPPED
[INFO] Spark Project External Flume ... SKIPPED
[INFO] Spark Project External Flume Assembly .. SKIPPED
[INFO] Spark Project External Akka  SKIPPED
[INFO] Spark Project External MQTT  SKIPPED
[INFO] Spark Project External MQTT Assembly ... SKIPPED
[INFO] Spark Project External ZeroMQ .. SKIPPED
[INFO] Spark Project External Kafka ... SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 4.909 s
[INFO] Finished at: 2016-01-27T12:40:12-08:00
[INFO] Final Memory: 47M/456M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on 
project spark-sketch_2.10: Execution default-test of goal 
org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test failed: There was an 
error in the forked process
[ERROR] java.lang.RuntimeException: Unable to load category: 
org.apache.spark.tags.DockerTest
[ERROR] at 
org.apache.maven.surefire.group.match.SingleGroupMatcher.loadGroupClasses(SingleGroupMatcher.java:157)
[ERROR] at 
org.apache.maven.surefire.common.junit48.FilterFactory.createGroupFilter(FilterFactory.java:93)
[ERROR] at 
org.apache.maven.surefire.junitcore.JUnitCoreProvider.createJUnit48Filter(JUnitCoreProvider.java:202)
[ERROR] at 
org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:119)
[ERROR] at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
[ERROR] at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
[ERROR] at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
[ERROR] Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.tags.DockerTest
[ERROR] at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
[ERROR] at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
[ERROR] at java.security.AccessController.doPrivileged(Native Method)
[ERROR] at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
[ERROR] at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
[ERROR] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
[ERROR] at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
[ERROR] at 
org.apache.maven.surefire.group.match.SingleGroupMatcher.loadGroupClasses(SingleGroupMatcher.java:153)
[ERROR] ... 6 more
{noformat}

  was:
Builds fail at the new sketch module when a scalatest tag is used. Found w

[jira] [Updated] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet


 [ 
https://issues.apache.org/jira/browse/SPARK-11955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11955:
---
Assignee: Liang-Chi Hsieh

> Mark one side fields in merging schema for safely pushdowning filters in 
> parquet
> 
>
> Key: SPARK-11955
> URL: https://issues.apache.org/jira/browse/SPARK-11955
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Currently we simply skip pushdowning filters in parquet if we enable schema 
> merging.
> However, we can actually mark one side fields in merging schema for safely 
> pushdowning filters in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet


 [ 
https://issues.apache.org/jira/browse/SPARK-11955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11955.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9940
[https://github.com/apache/spark/pull/9940]

> Mark one side fields in merging schema for safely pushdowning filters in 
> parquet
> 
>
> Key: SPARK-11955
> URL: https://issues.apache.org/jira/browse/SPARK-11955
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Currently we simply skip pushdowning filters in parquet if we enable schema 
> merging.
> However, we can actually mark one side fields in merging schema for safely 
> pushdowning filters in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules


[ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122043#comment-15122043
 ] 

Cheng Lian commented on SPARK-12725:


Thanks, this also sounds good to me. Will try this approach first.

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Lian
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11012) Canonicalize view definitions


 [ 
https://issues.apache.org/jira/browse/SPARK-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11012:
---
Description: 
In SPARK-10337, we added the first step of supporting view natively, which is 
basically wrapping the original view definition SQL text with an extra 
{{SELECT}} and then store the wrapped SQL text into metastore. This approach 
suffers at least two issues:

# Switching current database may break view queries
# HiveQL doesn't allow CTE as subquery, thus CTE can't be used in view 
definition

To fix these issues, we need to canonicalize the view definition. For example, 
for a SQL string
{code:sql}
SELECT a, b FROM table
{code}
we will save this text to Hive metastore as
{code:sql}
SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`
{code}

The core infrastructure of this work is SQL query string generation 
(SPARK-12593).  Namely, converting resolved logical query plans back to 
canonicalized SQL query strings. [PR 
#10541|https://github.com/apache/spark/pull/10541] set up basic infrastructure 
of SQL generation, but more language structures need to be supported.

  was:In SPARK-10337, we added the first step of supporting view natively. 
Building on top of that work, we need to canonicalize the view definition. So, 
for a SQL string SELECT a, b FROM table, we will save this text to Hive 
metastore as SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`. 


> Canonicalize view definitions
> -
>
> Key: SPARK-11012
> URL: https://issues.apache.org/jira/browse/SPARK-11012
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In SPARK-10337, we added the first step of supporting view natively, which is 
> basically wrapping the original view definition SQL text with an extra 
> {{SELECT}} and then store the wrapped SQL text into metastore. This approach 
> suffers at least two issues:
> # Switching current database may break view queries
> # HiveQL doesn't allow CTE as subquery, thus CTE can't be used in view 
> definition
> To fix these issues, we need to canonicalize the view definition. For 
> example, for a SQL string
> {code:sql}
> SELECT a, b FROM table
> {code}
> we will save this text to Hive metastore as
> {code:sql}
> SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`
> {code}
> The core infrastructure of this work is SQL query string generation 
> (SPARK-12593).  Namely, converting resolved logical query plans back to 
> canonicalized SQL query strings. [PR 
> #10541|https://github.com/apache/spark/pull/10541] set up basic 
> infrastructure of SQL generation, but more language structures need to be 
> supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13050) Scalatest tags fail builds with the addition of the sketch module


 [ 
https://issues.apache.org/jira/browse/SPARK-13050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13050:
---
Assignee: Alex Bozarth

> Scalatest tags fail builds with the addition of the sketch module
> -
>
> Key: SPARK-13050
> URL: https://issues.apache.org/jira/browse/SPARK-13050
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
>
> Builds fail at the new sketch module when a scalatest tag is used. Found when 
> using "-Dtest.exclude.tags=org.apache.spark.tags.DockerTest" since Docker 
> isn't install-able on CentOS 6.
> {noformat}
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  2.815 
> s]
> [INFO] Spark Project Sketch ... FAILURE [  1.148 
> s]
> [INFO] Spark Project Test Tags  SKIPPED
> [INFO] Spark Project Launcher . SKIPPED
> [INFO] Spark Project Networking ... SKIPPED
> [INFO] Spark Project Shuffle Streaming Service  SKIPPED
> [INFO] Spark Project Unsafe ... SKIPPED
> [INFO] Spark Project Core . SKIPPED
> [INFO] Spark Project GraphX ... SKIPPED
> [INFO] Spark Project Streaming  SKIPPED
> [INFO] Spark Project Catalyst . SKIPPED
> [INFO] Spark Project SQL .. SKIPPED
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project Docker Integration Tests . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project YARN Shuffle Service . SKIPPED
> [INFO] Spark Project YARN . SKIPPED
> [INFO] Spark Project Hive Thrift Server ... SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Project External Twitter . SKIPPED
> [INFO] Spark Project External Flume Sink .. SKIPPED
> [INFO] Spark Project External Flume ... SKIPPED
> [INFO] Spark Project External Flume Assembly .. SKIPPED
> [INFO] Spark Project External Akka  SKIPPED
> [INFO] Spark Project External MQTT  SKIPPED
> [INFO] Spark Project External MQTT Assembly ... SKIPPED
> [INFO] Spark Project External ZeroMQ .. SKIPPED
> [INFO] Spark Project External Kafka ... SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 4.909 s
> [INFO] Finished at: 2016-01-27T12:40:12-08:00
> [INFO] Final Memory: 47M/456M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on 
> project spark-sketch_2.10: Execution default-test of goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test failed: There was 
> an error in the forked process
> [ERROR] java.lang.RuntimeException: Unable to load category: 
> org.apache.spark.tags.DockerTest
> [ERROR] at 
> org.apache.maven.surefire.group.match.SingleGroupMatcher.loadGroupClasses(SingleGroupMatcher.java:157)
> [ERROR] at 
> org.apache.maven.surefire.common.junit48.FilterFactory.createGroupFilter(FilterFactory.java:93)
> [ERROR] at 
> org.apache.maven.surefire.junitcore.JUnitCoreProvider.createJUnit48Filter(JUnitCoreProvider.java:202)
> [ERROR] at 
> org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:119)
> [ERROR] at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
> [ERROR] at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
> [ERROR] at 
> org.apache

[jira] [Created] (SPARK-13070) Points which physical file is the trouble maker when Parquet schema merging fails

Cheng Lian created SPARK-13070:
--

 Summary: Points which physical file is the trouble maker when 
Parquet schema merging fails
 Key: SPARK-13070
 URL: https://issues.apache.org/jira/browse/SPARK-13070
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


As a user, I'd like to know which physical file is the trouble maker when 
Parquet schema merging fails. Currently, we only have an error message like 
this:
{quote}
Failed to merge incompatible data types LongType and IntegerType
{quote}
Would be nice to add the file path and the actual schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13070) Points out which physical file is the trouble maker when Parquet schema merging fails


 [ 
https://issues.apache.org/jira/browse/SPARK-13070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-13070:
---
Summary: Points out which physical file is the trouble maker when Parquet 
schema merging fails  (was: Points which physical file is the trouble maker 
when Parquet schema merging fails)

> Points out which physical file is the trouble maker when Parquet schema 
> merging fails
> -
>
> Key: SPARK-13070
> URL: https://issues.apache.org/jira/browse/SPARK-13070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> As a user, I'd like to know which physical file is the trouble maker when 
> Parquet schema merging fails. Currently, we only have an error message like 
> this:
> {quote}
> Failed to merge incompatible data types LongType and IntegerType
> {quote}
> Would be nice to add the file path and the actual schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Parquet for very wide table

2016-01-25 Thread Cheng Lian

Aside from Nong's comment, I think PARQUET-222, where we discussed a 
performance issue of writing wide tables, can be helpful.


Cheng

On 1/23/16 4:53 PM, Nong Li wrote:

I expect this to be difficult. This is roughly 3 orders of magnitude more
than even
a typical wide table use case.

Answers inline.

On Thu, Jan 21, 2016 at 2:10 PM, Krishna  wrote:


We are considering using Parquet for storing a matrix that is dense and
very, very wide (can have more than 600K columns).

I've following questions:

- Is there is a limit on # of columns in Parquet file? We expect to
query [10-100] columns at a time using Spark - what are the performance
implications in this scenario?


There is no hard limit but I think you'll probably run into some issues.
There will
probably be code paths that are not optimized for schemas this big but I
expect
those to be easier to address. The default configurations will probably not
work
well (the metadata to data ratio would be bad). You can try configuring
very large
row groups and see how that goes.



- We want a schema-less solution since the matrix can get wider over a
period of time
- Is there a way to generate such wide structured schema-less Parquet
files using map-reduce (input files are in custom binary format)?


No, Parquet requires a schema. The schema is flexible so you could map your
schema
to a parquet schema (each column could be binary for example.) Why are you
looking to
use Parquet for this use case?



- HBase can support millions of columns - anyone with prior experience
that compares Parquet vs HFile performance for wide structured tables?

- Does Impala have support for evolving schema?
Yes. Different systems have different rules on what is allowed but the case
of appending
a column to an existing schema should be well supported.


Krishna

Re: cast column string -> timestamp in Parquet file

2016-01-25 Thread Cheng Lian


The following snippet may help:

  sqlContext.read.parquet(path).withColumn("col_ts", 
$"col".cast(TimestampType)).drop("col")


Cheng

On 1/21/16 6:58 AM, Muthu Jayakumar wrote:
DataFrame and udf. This may be more performant than doing an RDD 
transformation as you'll only transform just the column that requires 
to be changed.


Hope this helps.


On Thu, Jan 21, 2016 at 6:17 AM, Eli Super > wrote:


Hi

I have a large size parquet file .

I need to cast the whole column to timestamp format , then save

What the right way to do it ?

Thanks a lot

Re: Parquet for very wide table

2016-01-25 Thread Cheng Lian

PARQUET-222 is mostly a memory issue caused by the # of columns. On the 
write path, each column comes with write buffers, and they can 
accumulate to a large amount. In the case investigated in PARQUET-222, 
it took more than 10G to write a single row consists of 26k integer 
columns. I.e., this issue is related to column count rather than row count.


But that was the situation of Parquet 1.6. I haven't checked all the 
memory management improvements happened recently, and haven't repeated 
the experiment using newer versions of Parquet yet.


Cheng

On 1/25/16 11:50 AM, Krishna wrote:

Thanks Cheng, Nong.

Data in the matrix is homogenous (cells are booleans), so, I don't expect
to face memory related issues. Is the limitation on the # of columns or
memory issues caused by the # of columns? To me it sounds more like memory
issues.

On Mon, Jan 25, 2016 at 10:16 AM, Cheng Lian <lian.cs@gmail.com> wrote:


Aside from Nong's comment, I think PARQUET-222, where we discussed a
performance issue of writing wide tables, can be helpful.

Cheng


On 1/23/16 4:53 PM, Nong Li wrote:


I expect this to be difficult. This is roughly 3 orders of magnitude more
than even
a typical wide table use case.

Answers inline.

On Thu, Jan 21, 2016 at 2:10 PM, Krishna <research...@gmail.com> wrote:

We are considering using Parquet for storing a matrix that is dense and

very, very wide (can have more than 600K columns).


I've following questions:


 - Is there is a limit on # of columns in Parquet file? We expect to
 query [10-100] columns at a time using Spark - what are the
performance
 implications in this scenario?

There is no hard limit but I think you'll probably run into some issues.

There will
probably be code paths that are not optimized for schemas this big but I
expect
those to be easier to address. The default configurations will probably
not
work
well (the metadata to data ratio would be bad). You can try configuring
very large
row groups and see how that goes.


 - We want a schema-less solution since the matrix can get wider over a

 period of time
 - Is there a way to generate such wide structured schema-less Parquet
 files using map-reduce (input files are in custom binary format)?

No, Parquet requires a schema. The schema is flexible so you could map

your
schema
to a parquet schema (each column could be binary for example.) Why are you
looking to
use Parquet for this use case?


 - HBase can support millions of columns - anyone with prior experience

 that compares Parquet vs HFile performance for wide structured
tables?


 - Does Impala have support for evolving schema?
Yes. Different systems have different rules on what is allowed but the
case
of appending
a column to an existing schema should be well supported.

Krishna

[jira] [Updated] (SPARK-12624) When schema is specified, we should give better error message if actual row length doesn't match

2016-01-23 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12624:
---
Summary: When schema is specified, we should give better error message if 
actual row length doesn't match  (was: When schema is specified, we should 
treat undeclared fields as null (in Python))

> When schema is specified, we should give better error message if actual row 
> length doesn't match
> 
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> See https://github.com/apache/spark/pull/10564
> Basically that test case should pass without the above fix and just assume b 
> is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12624) When schema is specified, we should give better error message if actual row length doesn't match

2016-01-23 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114123#comment-15114123
 ] 

Cheng Lian commented on SPARK-12624:


Quoted Davies' offline comment
{quote}
We always raise a exception in 1.4/1.5, different exception in 1.6/master, (the 
show() of 1.6.0 is weird, don't know why).

We should raise a better exception when deserializing them in JVM 
(EvaluatePython.fromJava).
{quote}

> When schema is specified, we should give better error message if actual row 
> length doesn't match
> 
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> See https://github.com/apache/spark/pull/10564
> Basically that test case should pass without the above fix and just assume b 
> is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12624) When schema is specified, we should give better error message if actual row length doesn't match

2016-01-23 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12624:
---
Description: 
The following code snippet reproduces this issue:
{code}
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.types import Row

schema = StructType([StructField("a", IntegerType()), StructField("b", 
StringType())])
rdd = sc.parallelize(range(10)).map(lambda x: Row(a=x))
df = sqlContext.createDataFrame(rdd, schema)
df.show()
{code}
An unintuitive {{ArrayIndexOutOfBoundsException}} exception is thrown in this 
case:
{code}
...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:227)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:35)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:36)
...
{code}
We should give a better error message here.

  was:
See https://github.com/apache/spark/pull/10564

Basically that test case should pass without the above fix and just assume b is 
null.



> When schema is specified, we should give better error message if actual row 
> length doesn't match
> 
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> The following code snippet reproduces this issue:
> {code}
> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
> from pyspark.sql.types import Row
> schema = StructType([StructField("a", IntegerType()), StructField("b", 
> StringType())])
> rdd = sc.parallelize(range(10)).map(lambda x: Row(a=x))
> df = sqlContext.createDataFrame(rdd, schema)
> df.show()
> {code}
> An unintuitive {{ArrayIndexOutOfBoundsException}} exception is thrown in this 
> case:
> {code}
> ...
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:227)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getAs(rows.scala:35)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.isNullAt(rows.scala:36)
> ...
> {code}
> We should give a better error message here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-12818) Implement Bloom filter and count-min sketch in DataFrames


 [ 
https://issues.apache.org/jira/browse/SPARK-12818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12818:
---
Comment: was deleted

(was: User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10851)

> Implement Bloom filter and count-min sketch in DataFrames
> -
>
> Key: SPARK-12818
> URL: https://issues.apache.org/jira/browse/SPARK-12818
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: BloomFilterandCount-MinSketchinSpark2.0.pdf
>
>
> This ticket tracks implementing Bloom filter and count-min sketch support in 
> DataFrames. Please see the attached design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12938) Bloom filter DataFrame API integration


 [ 
https://issues.apache.org/jira/browse/SPARK-12938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12938:
---
Assignee: Wenchen Fan  (was: Cheng Lian)

> Bloom filter DataFrame API integration
> --
>
> Key: SPARK-12938
> URL: https://issues.apache.org/jira/browse/SPARK-12938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12936) Initial bloom filter implementation


 [ 
https://issues.apache.org/jira/browse/SPARK-12936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12936:
---
Assignee: Wenchen Fan  (was: Cheng Lian)

> Initial bloom filter implementation
> ---
>
> Key: SPARK-12936
> URL: https://issues.apache.org/jira/browse/SPARK-12936
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12935) Count-min sketch DataFrame API integration

Cheng Lian created SPARK-12935:
--

 Summary: Count-min sketch DataFrame API integration
 Key: SPARK-12935
 URL: https://issues.apache.org/jira/browse/SPARK-12935
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12933) Initial count-min sketch implementation

Cheng Lian created SPARK-12933:
--

 Summary: Initial count-min sketch implementation
 Key: SPARK-12933
 URL: https://issues.apache.org/jira/browse/SPARK-12933
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12934) Count-min sketch serialization

Cheng Lian created SPARK-12934:
--

 Summary: Count-min sketch serialization
 Key: SPARK-12934
 URL: https://issues.apache.org/jira/browse/SPARK-12934
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12938) Bloom filter DataFrame API integration

Cheng Lian created SPARK-12938:
--

 Summary: Bloom filter DataFrame API integration
 Key: SPARK-12938
 URL: https://issues.apache.org/jira/browse/SPARK-12938
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12937) Bloom filter serialization

Cheng Lian created SPARK-12937:
--

 Summary: Bloom filter serialization
 Key: SPARK-12937
 URL: https://issues.apache.org/jira/browse/SPARK-12937
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12936) Initial bloom filter implementation

Cheng Lian created SPARK-12936:
--

 Summary: Initial bloom filter implementation
 Key: SPARK-12936
 URL: https://issues.apache.org/jira/browse/SPARK-12936
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12937) Bloom filter serialization


 [ 
https://issues.apache.org/jira/browse/SPARK-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12937:
---
Assignee: Wenchen Fan  (was: Cheng Lian)

> Bloom filter serialization
> --
>
> Key: SPARK-12937
> URL: https://issues.apache.org/jira/browse/SPARK-12937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12560) SqlTestUtils.stripSparkFilter needs to copy utf8strings

2016-01-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12560.

Resolution: Fixed
  Assignee: Imran Rashid

Resolved by https://github.com/apache/spark/pull/10510

> SqlTestUtils.stripSparkFilter needs to copy utf8strings
> ---
>
> Key: SPARK-12560
> URL: https://issues.apache.org/jira/browse/SPARK-12560
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
>
> {{SqlTestUtils.stripSparkFilter}} needs to make copies of the UTF8Strings, 
> eg., with {{FromUnsafeProjection}} to avoid returning duplicates of the same 
> row (see SPARK-9459).
> Right now, this isn't causing any problems, since the parquet string 
> predicate pushdown is turned off (see SPARK-11153).  However I ran into this 
> while trying to get the predicate pushdown to work with a different version 
> of parquet.  Without this fix, there were errors like:
> {noformat}
> [info]   !== Correct Answer - 4 ==   == Spark Answer - 4 ==
> [info]   ![1][2]
> [info][2][2]
> [info]   ![3][4]
> [info][4][4] (QueryTest.scala:127)
> {noformat}
> I figure its worth making this change now while I ran into it.  PR coming 
> shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12560) SqlTestUtils.stripSparkFilter needs to copy utf8strings

2016-01-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12560:
---
Fix Version/s: 2.0.0

> SqlTestUtils.stripSparkFilter needs to copy utf8strings
> ---
>
> Key: SPARK-12560
> URL: https://issues.apache.org/jira/browse/SPARK-12560
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{SqlTestUtils.stripSparkFilter}} needs to make copies of the UTF8Strings, 
> eg., with {{FromUnsafeProjection}} to avoid returning duplicates of the same 
> row (see SPARK-9459).
> Right now, this isn't causing any problems, since the parquet string 
> predicate pushdown is turned off (see SPARK-11153).  However I ran into this 
> while trying to get the predicate pushdown to work with a different version 
> of parquet.  Without this fix, there were errors like:
> {noformat}
> [info]   !== Correct Answer - 4 ==   == Spark Answer - 4 ==
> [info]   ![1][2]
> [info][2][2]
> [info]   ![3][4]
> [info][4][4] (QueryTest.scala:127)
> {noformat}
> I figure its worth making this change now while I ran into it.  PR coming 
> shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-19 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12867.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10812
[https://github.com/apache/spark/pull/10812]

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Cheng Lian
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-18 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105616#comment-15105616
 ] 

Cheng Lian commented on SPARK-12867:


Thanks for helping! Go ahead, please. I'm assigning this to you.

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Cheng Lian
>Priority: Minor
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-18 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12867:
---
Assignee: Xiao Li

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Cheng Lian
>Assignee: Xiao Li
>Priority: Minor
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-17 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12867:
--

 Summary: Nullability of Intersect can be stricter
 Key: SPARK-12867
 URL: https://issues.apache.org/jira/browse/SPARK-12867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Priority: Minor


{{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
{code}
  override def output: Seq[Attribute] =
left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
  leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
}
{code}
However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Cheng Lian

You may try DataFrame.repartition(partitionExprs: Column*) to shuffle 
all data belonging to a single (data) partition into a single (RDD) 
partition:


|df.coalesce(1)|||.repartition("entity", "year", "month", "day", 
"status")|.write.partitionBy("entity", "year", "month", "day", 
"status").mode(SaveMode.Append).parquet(s"$location")|


(Unfortunately the naming here can be quite confusing.)

Cheng

On 1/14/16 11:48 PM, Patrick McGloin wrote:

Hi,

I would like to reparation / coalesce my data so that it is saved into 
one Parquet file per partition. I would also like to use the Spark SQL 
partitionBy API. So I could do that like this:


|df.coalesce(1).write.partitionBy("entity", "year", "month", "day", 
"status").mode(SaveMode.Append).parquet(s"$location") |


I've tested this and it doesn't seem to perform well. This is because 
there is only one partition to work on in the dataset and all the 
partitioning, compression and saving of files has to be done by one 
CPU core.


I could rewrite this to do the partitioning manually (using filter 
with the distinct partition values for example) before calling coalesce.


But is there a better way to do this using the standard Spark SQL API?

Best regards,

Patrick

[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore


[ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094471#comment-15094471
 ] 

Cheng Lian commented on SPARK-12403:


Also, could you please provide the exact version number of the Simba ODBC 
driver (e.g. something like 1.0.8.1006)?

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12724) SQL generation support for persisted data source relations


 [ 
https://issues.apache.org/jira/browse/SPARK-12724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-12724:
--

Assignee: Cheng Lian

> SQL generation support for persisted data source relations
> --
>
> Key: SPARK-12724
> URL: https://issues.apache.org/jira/browse/SPARK-12724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12724) SQL generation support for persisted data source relations


 [ 
https://issues.apache.org/jira/browse/SPARK-12724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12724.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10712
[https://github.com/apache/spark/pull/10712]

> SQL generation support for persisted data source relations
> --
>
> Key: SPARK-12724
> URL: https://issues.apache.org/jira/browse/SPARK-12724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore


[ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094466#comment-15094466
 ] 

Cheng Lian commented on SPARK-12403:


Hi [~lunendl], I wonder what kind of Hive metastore were you using? Embedded 
mode, local mode, or remote mode? I suspect this issue is related to 
SPARK-11783 and/or SPARK-9686.

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-12 Thread Cheng Lian

I see. So there are actually 3000 tasks instead of 3000 jobs right?

Would you mind to provide the full stack trace of the GC issue? At first
I thought it's identical to the _metadata one in the mail thread you
mentioned.

Cheng

On 1/11/16 5:30 PM, Gavin Yue wrote:
Here is how I set the conf:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

This actually works, I do not see the _metadata file anymore.

I think I made a mistake. The 3000 jobs are coming from
repartition("id").

I have 7600 json files and want to save as parquet.

So if I use: df.write.parquet(path), it would generate 7600 parquet
files with 7600 parititions which has no problem.

But if I use repartition to change partition number, say:
df.reparition(3000).write.parquet

This would generate 7600 + 3000 tasks. 3000 tasks always fails due to
GC problem.

Best,
Gavin

On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian <lian.cs@gmail.com
<mailto:lian.cs@gmail.com>> wrote:

Hey Gavin,

Could you please provide a snippet of your code to show how did
you disabled "parquet.enable.summary-metadata" and wrote the
files? Especially, you mentioned you saw "3000 jobs" failed. Were
you writing each Parquet file with an individual job? (Usually
people use write.partitionBy(...).parquet(...) to write multiple
Parquet files.)

Cheng

On 1/10/16 10:12 PM, Gavin Yue wrote:

Hey,

I am trying to convert a bunch of json files into parquet,
which would output over 7000 parquet files. But tthere are too
many files, so I want to repartition based on id to 3000.

But I got the error of GC problem like this one:

https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives

So I set parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the
writing parquet and they failed due to GC.

Basically repartition never succeeded for me. Is there any
other settings which could be optimized?

Thanks,
Gavin

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Cheng Lian


Hey Gavin,

Could you please provide a snippet of your code to show how did you 
disabled "parquet.enable.summary-metadata" and wrote the files? 
Especially, you mentioned you saw "3000 jobs" failed. Were you writing 
each Parquet file with an individual job? (Usually people use 
write.partitionBy(...).parquet(...) to write multiple Parquet files.)


Cheng

On 1/10/16 10:12 PM, Gavin Yue wrote:

Hey,

I am trying to convert a bunch of json files into parquet, which would 
output over 7000 parquet files. But tthere are too many files, so I 
want to repartition based on id to 3000.


But I got the error of GC problem like this one: 
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives


So I set  parquet.enable.summary-metadata to false. But when I 
write.parquet, I could still see the 3000 jobs run after the writing 
parquet and they failed due to GC.


Basically repartition never succeeded for me. Is there any other 
settings which could be optimized?


Thanks,
Gavin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[jira] [Updated] (SPARK-12742) org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists

2016-01-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12742:
---
Assignee: Fei Wang

> org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already 
> exists
> ---
>
> Key: SPARK-12742
> URL: https://issues.apache.org/jira/browse/SPARK-12742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Fei Wang
>Assignee: Fei Wang
> Fix For: 2.0.0
>
>
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 
> milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12742) org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists

2016-01-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12742.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10682
[https://github.com/apache/spark/pull/10682]

> org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already 
> exists
> ---
>
> Key: SPARK-12742
> URL: https://issues.apache.org/jira/browse/SPARK-12742
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Fei Wang
> Fix For: 2.0.0
>
>
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 
> milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
> [info]   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
> [info]   at 
> org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

Cheng Lian created SPARK-12725:
--

 Summary: SQL generation suffers from name conficts introduced by 
some analysis rules
 Key: SPARK-12725
 URL: https://issues.apache.org/jira/browse/SPARK-12725
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian


Some analysis rules generate auxiliary attribute references with the same name 
but different expression IDs. For example, {{ResolveAggregateFunctions}} 
introduces {{havingCondition}} and {{aggOrder}}, and 
{{DistinctAggregationRewriter}} introduces {{gid}}.

This is OK for normal query execution since these attribute references get 
expression IDs. However, it's troublesome when converting resolved query plans 
back to SQL query strings since expression IDs are erased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12723) Comprehensive SQL generation support for expressions


 [ 
https://issues.apache.org/jira/browse/SPARK-12723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12723:
---
Description: 
Ensure that all built-in expressions can be mapped to its SQL representation if 
there is one (e.g. ScalaUDF doesn't have a SQL representation).

A (possibly incomplete) list of unsupported expressions is provided in PR 
description of [PR #10541|https://github.com/apache/spark/pull/10541]:

- Math expressions
- String expressions
- Null expressions
- Calendar interval literal
- Part of date time expressions
- Complex type creators
- Special NOT expressions, e.g. NOT LIKE and NOT IN

  was:
Ensure that all built-in expressions can be mapped to its SQL representation if 
there is one (e.g. ScalaUDF doesn't have a SQL representation).

A (possibly incomplete) list of unsupported expressions is provided in PR 
description of [PR #10541|https://github.com/apache/spark/pull/10541]:


> Comprehensive SQL generation support for expressions
> 
>
> Key: SPARK-12723
> URL: https://issues.apache.org/jira/browse/SPARK-12723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Lian
>
> Ensure that all built-in expressions can be mapped to its SQL representation 
> if there is one (e.g. ScalaUDF doesn't have a SQL representation).
> A (possibly incomplete) list of unsupported expressions is provided in PR 
> description of [PR #10541|https://github.com/apache/spark/pull/10541]:
> - Math expressions
> - String expressions
> - Null expressions
> - Calendar interval literal
> - Part of date time expressions
> - Complex type creators
> - Special NOT expressions, e.g. NOT LIKE and NOT IN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules


 [ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12725:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12728) Integrate SQL generation feature with native view


 [ 
https://issues.apache.org/jira/browse/SPARK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12728:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> Integrate SQL generation feature with native view
> -
>
> Key: SPARK-12728
> URL: https://issues.apache.org/jira/browse/SPARK-12728
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12727) SQL generation support for distinct aggregation patterns that fit DistinctAggregationRewriter analysis rule


 [ 
https://issues.apache.org/jira/browse/SPARK-12727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12727:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for distinct aggregation patterns that fit 
> DistinctAggregationRewriter analysis rule
> ---
>
> Key: SPARK-12727
> URL: https://issues.apache.org/jira/browse/SPARK-12727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

Cheng Lian created SPARK-12720:
--

 Summary: SQL generation support for cube, rollup, and grouping set
 Key: SPARK-12720
 URL: https://issues.apache.org/jira/browse/SPARK-12720
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12721) SQL generation support for script transformation

Cheng Lian created SPARK-12721:
--

 Summary: SQL generation support for script transformation
 Key: SPARK-12721
 URL: https://issues.apache.org/jira/browse/SPARK-12721
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11012) Canonicalize view definitions


[ 
https://issues.apache.org/jira/browse/SPARK-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090110#comment-15090110
 ] 

Cheng Lian commented on SPARK-11012:


Done.

> Canonicalize view definitions
> -
>
> Key: SPARK-11012
> URL: https://issues.apache.org/jira/browse/SPARK-11012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> In SPARK-10337, we added the first step of supporting view natively. Building 
> on top of that work, we need to canonicalize the view definition. So, for a 
> SQL string SELECT a, b FROM table, we will save this text to Hive metastore 
> as SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11012) Canonicalize view definitions


 [ 
https://issues.apache.org/jira/browse/SPARK-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11012:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> Canonicalize view definitions
> -
>
> Key: SPARK-11012
> URL: https://issues.apache.org/jira/browse/SPARK-11012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In SPARK-10337, we added the first step of supporting view natively. Building 
> on top of that work, we need to canonicalize the view definition. So, for a 
> SQL string SELECT a, b FROM table, we will save this text to Hive metastore 
> as SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12728) Integrate SQL generation feature with native view

Cheng Lian created SPARK-12728:
--

 Summary: Integrate SQL generation feature with native view
 Key: SPARK-12728
 URL: https://issues.apache.org/jira/browse/SPARK-12728
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12718) SQL generation support for window functions


 [ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12718:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12720) SQL generation support for cube, rollup, and grouping set


 [ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12720:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12719) SQL generation support for generators (including UDTF)


 [ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12719:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12726) ParquetConversions doesn't always propagate metastore table identifier to ParquetRelation


 [ 
https://issues.apache.org/jira/browse/SPARK-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12726:
---
Description: (I hit this issue while working on SPARK-12593, but haven't 
got time to investigate it. Will fill more details when I get some clue.)  
(was: (I hit this issue while working on 12573, but haven't got time to 
investigate it. Will fill more details when I get some clue.))

> ParquetConversions doesn't always propagate metastore table identifier to 
> ParquetRelation
> -
>
> Key: SPARK-12726
> URL: https://issues.apache.org/jira/browse/SPARK-12726
> Project: Spark
>  Issue Type: Bug
>    Reporter: Cheng Lian
>
> (I hit this issue while working on SPARK-12593, but haven't got time to 
> investigate it. Will fill more details when I get some clue.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12723) Comprehensive SQL generation support for expressions


 [ 
https://issues.apache.org/jira/browse/SPARK-12723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12723:
---
Description: 
Ensure that all built-in expressions can be mapped to its SQL representation if 
there is one (e.g. ScalaUDF doesn't have a SQL representation).

A (possibly incomplete) list of unsupported expressions is provided in PR 
description of [PR #10541|https://github.com/apache/spark/pull/10541]:

  was:
Ensure that all built-in expressions can be mapped to its SQL representation if 
there is one (e.g. ScalaUDF doesn't have a SQL representation).

A (possibly incomplete) list of unsupported expressions is provided in PR 
description of 


> Comprehensive SQL generation support for expressions
> 
>
> Key: SPARK-12723
> URL: https://issues.apache.org/jira/browse/SPARK-12723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Lian
>
> Ensure that all built-in expressions can be mapped to its SQL representation 
> if there is one (e.g. ScalaUDF doesn't have a SQL representation).
> A (possibly incomplete) list of unsupported expressions is provided in PR 
> description of [PR #10541|https://github.com/apache/spark/pull/10541]:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12723) Comprehensive SQL generation support for expressions


 [ 
https://issues.apache.org/jira/browse/SPARK-12723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12723:
---
Description: 
Ensure that all built-in expressions can be mapped to its SQL representation if 
there is one (e.g. ScalaUDF doesn't have a SQL representation).

A (possibly incomplete) list of unsupported expressions is provided in PR 
description of 

  was:Ensure that all built-in expressions can be mapped to its SQL 
representation if there is one (e.g. ScalaUDF doesn't have a SQL 
representation).


> Comprehensive SQL generation support for expressions
> 
>
> Key: SPARK-12723
> URL: https://issues.apache.org/jira/browse/SPARK-12723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Lian
>
> Ensure that all built-in expressions can be mapped to its SQL representation 
> if there is one (e.g. ScalaUDF doesn't have a SQL representation).
> A (possibly incomplete) list of unsupported expressions is provided in PR 
> description of 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12727) SQL generation support for distinct aggregation patterns that fit DistinctAggregationRewriter analysis rule

Cheng Lian created SPARK-12727:
--

 Summary: SQL generation support for distinct aggregation patterns 
that fit DistinctAggregationRewriter analysis rule
 Key: SPARK-12727
 URL: https://issues.apache.org/jira/browse/SPARK-12727
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12718) SQL generation support for window functions

Cheng Lian created SPARK-12718:
--

 Summary: SQL generation support for window functions
 Key: SPARK-12718
 URL: https://issues.apache.org/jira/browse/SPARK-12718
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12593) Convert basic resolved logical plans back to SQL query strings


 [ 
https://issues.apache.org/jira/browse/SPARK-12593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12593:
---
Summary: Convert basic resolved logical plans back to SQL query strings  
(was: Convert resolved logical plans back to SQL query strings)

> Convert basic resolved logical plans back to SQL query strings
> --
>
> Key: SPARK-12593
> URL: https://issues.apache.org/jira/browse/SPARK-12593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>    Reporter: Cheng Lian
>    Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12719) SQL generation support for generators (including UDTF)

Cheng Lian created SPARK-12719:
--

 Summary: SQL generation support for generators (including UDTF)
 Key: SPARK-12719
 URL: https://issues.apache.org/jira/browse/SPARK-12719
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12724) SQL generation support for persisted data source relations

Cheng Lian created SPARK-12724:
--

 Summary: SQL generation support for persisted data source relations
 Key: SPARK-12724
 URL: https://issues.apache.org/jira/browse/SPARK-12724
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12723) Comprehensive SQL generation support for expressions

Cheng Lian created SPARK-12723:
--

 Summary: Comprehensive SQL generation support for expressions
 Key: SPARK-12723
 URL: https://issues.apache.org/jira/browse/SPARK-12723
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian


Ensure that all built-in expressions can be mapped to its SQL representation if 
there is one (e.g. ScalaUDF doesn't have a SQL representation).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12726) ParquetConversions doesn't always propagate metastore table identifier to ParquetRelation

Cheng Lian created SPARK-12726:
--

 Summary: ParquetConversions doesn't always propagate metastore 
table identifier to ParquetRelation
 Key: SPARK-12726
 URL: https://issues.apache.org/jira/browse/SPARK-12726
 Project: Spark
  Issue Type: Bug
Reporter: Cheng Lian


(I hit this issue while working on 12573, but haven't got time to investigate 
it. Will fill more details when I get some clue.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12721) SQL generation support for script transformation


 [ 
https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12721:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for script transformation
> 
>
> Key: SPARK-12721
> URL: https://issues.apache.org/jira/browse/SPARK-12721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12723) Comprehensive SQL generation support for expressions


 [ 
https://issues.apache.org/jira/browse/SPARK-12723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12723:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> Comprehensive SQL generation support for expressions
> 
>
> Key: SPARK-12723
> URL: https://issues.apache.org/jira/browse/SPARK-12723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>
> Ensure that all built-in expressions can be mapped to its SQL representation 
> if there is one (e.g. ScalaUDF doesn't have a SQL representation).
> A (possibly incomplete) list of unsupported expressions is provided in PR 
> description of [PR #10541|https://github.com/apache/spark/pull/10541]:
> - Math expressions
> - String expressions
> - Null expressions
> - Calendar interval literal
> - Part of date time expressions
> - Complex type creators
> - Special NOT expressions, e.g. NOT LIKE and NOT IN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12724) SQL generation support for persisted data source relations


 [ 
https://issues.apache.org/jira/browse/SPARK-12724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12724:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for persisted data source relations
> --
>
> Key: SPARK-12724
> URL: https://issues.apache.org/jira/browse/SPARK-12724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>    Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12593) Convert resolved logical plans back to SQL query strings

2015-12-31 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12593:
--

 Summary: Convert resolved logical plans back to SQL query strings
 Key: SPARK-12593
 URL: https://issues.apache.org/jira/browse/SPARK-12593
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12592) TestHive.reset hides Spark testing logs

2015-12-31 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12592:
--

 Summary: TestHive.reset hides Spark testing logs
 Key: SPARK-12592
 URL: https://issues.apache.org/jira/browse/SPARK-12592
 Project: Spark
  Issue Type: Test
  Components: Tests
Reporter: Cheng Lian


There's a hack done in {{TestHive.reset()}}, which intended to mute noisy Hive 
loggers. However, Spark testing loggers are also muted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5948) Support writing to partitioned table for the Parquet data source

2015-12-28 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073596#comment-15073596
 ] 

Cheng Lian commented on SPARK-5948:
---

It's the HadoopFsRelation based Parquet data source. HadoopFsRelation supports 
both reading and writing partitioned tables.

> Support writing to partitioned table for the Parquet data source
> 
>
> Key: SPARK-5948
> URL: https://issues.apache.org/jira/browse/SPARK-5948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Cheng Lian
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.4.0
>
>
> In 1.3.0, we added support for reading partitioned tables declared in Hive 
> metastore for the Parquet data source. However, writing to partitioned tables 
> is not supported yet. This feature should probably built upon SPARK-5947.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-26 Thread Cheng Lian

+1

On 12/23/15 12:39 PM, Yin Huai wrote:

+1

On Tue, Dec 22, 2015 at 8:10 PM, Denny Lee > wrote:

+1

On Tue, Dec 22, 2015 at 7:05 PM Aaron Davidson > wrote:

+1

On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen
>
wrote:

+1

On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang
> wrote:

+1

On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra
> wrote:

+1

On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust
> wrote:

Please vote on releasing the following
candidate as Apache Spark version 1.6.0!

The vote is open until Friday, December 25,
2015 at 18:00 UTC and passes if a majority of
at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

The tag to be voted on is _v1.6.0-rc4
(4062cda3087ae42c6c3cb24508fc1d3a931accdf)
_

The release files, including signatures,
digests, etc. can be found at:

http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/

Release artifacts are signed with the
following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be
found at:

https://repository.apache.org/content/repositories/orgapachespark-1176/

The test repository (versioned as v1.6.0-rc4)
for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1175/

The documentation corresponding to this
release can be found at:

http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/

===
== How can I help test this release? ==
===
If you are a Spark user, you can help us test
this release by taking an existing Spark
workload and running on this release
candidate, then reporting any regressions.

== What justifies a -1 vote for this release? ==

This vote is happening towards the end of the
1.6 QA period, so -1 votes should only occur
for significant regressions from 1.5. Bugs
already present in 1.5, minor regressions, or
bugs related to new features will not block
this release.

===
== What should happen to JIRA tickets still
targeting 1.6.0? ==

===
1. It is OK for documentation patches to
target 1.6.0 and still go into branch-1.6,
since documentations will be published
separately from the release.
2. New features for non-alpha-modules should
target 1.7+.
3. Non-blocker bug fixes should target 1.6.1
or 1.7.0, or drop the target version.

==
== Major changes to help you focus your testing ==

[jira] [Created] (SPARK-12498) BooleanSimplification cleanup

2015-12-23 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12498:
--

 Summary: BooleanSimplification cleanup
 Key: SPARK-12498
 URL: https://issues.apache.org/jira/browse/SPARK-12498
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


Scala syntax and our existing Catalyst expression DSL allow us to refactor 
{{BooleanSimplification}} substantially clearer and more readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12478) Dataset fields of product types can't be null


[ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069044#comment-15069044
 ] 

Cheng Lian commented on SPARK-12478:


I'm leaving this ticket open since we also need to backport this to branch-1.6 
after the release.

> Dataset fields of product types can't be null
> -
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>    Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: backport-needed
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> a

[jira] [Updated] (SPARK-12478) Dataset fields of product types can't be null


 [ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12478:
---
Labels: backport-needed  (was: )

> Dataset fields of product types can't be null
> -
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>    Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: backport-needed
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.ref

[jira] [Resolved] (SPARK-11164) Add InSet pushdown filter back for Parquet


 [ 
https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11164.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10278
[https://github.com/apache/spark/pull/10278]

> Add InSet pushdown filter back for Parquet
> --
>
> Key: SPARK-11164
> URL: https://issues.apache.org/jira/browse/SPARK-11164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11164) Add InSet pushdown filter back for Parquet


 [ 
https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11164:
---
Assignee: Xiao Li

> Add InSet pushdown filter back for Parquet
> --
>
> Key: SPARK-11164
> URL: https://issues.apache.org/jira/browse/SPARK-11164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Xiao Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12478) Dataset fields of product types can't be null


 [ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12478:
---
Summary: Dataset fields of product types can't be null  (was: Dataset 
fields whose types are case classs can't be null)

> Dataset fields of product types can't be null
> -
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>

[jira] [Updated] (SPARK-12478) Dataset fields whose types are case classs can't be null


 [ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12478:
---
Summary: Dataset fields whose types are case classs can't be null  (was: 
Top level case class field of a Dataset can't be null)

> Dataset fields whose types are case classs can't be null
> 
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>    Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native

[jira] [Created] (SPARK-12478) Top level case class field of a Dataset can't be null