[jira] [Assigned] (SPARK-31869) BroadcastHashJoinExe's outputPartitioning can utilize the build side

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31869:


Assignee: Apache Spark

> BroadcastHashJoinExe's outputPartitioning can utilize the build side
> 
>
> Key: SPARK-31869
> URL: https://issues.apache.org/jira/browse/SPARK-31869
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the BroadcastHashJoinExec's outputPartitioning only uses the 
> streamed side's outputPartitioning. Thus, if the join key is from the build 
> side for the join where one side is BroadcastHashJoinExec:
> {code:java}
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
> val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
> val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2")
> val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3")
> val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4")
> // join1 is a sort merge join.
> val join1 = t1.join(t2, t1("i1") === t2("i2"))
> // join2 is a broadcast join where t3 is broadcasted.
> val join2 = join1.join(t3, join1("i1") === t3("i3"))
> // Join on the column from the broadcasted side (i3).
> val join3 = join2.join(t4, join2("i3") === t4("i4"))
> join3.explain
> {code}
> it produces Exchange hashpartitioning(i2#103, 200):
> {code:java}
> == Physical Plan ==
> *(6) SortMergeJoin [i3#29], [i4#40], Inner
> :- *(4) Sort [i3#29 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(i3#29, 200), true, [id=#55]
> : +- *(3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight
> ::- *(3) SortMergeJoin [i1#7], [i2#18], Inner
> ::  :- *(1) Sort [i1#7 ASC NULLS FIRST], false, 0
> ::  :  +- Exchange hashpartitioning(i1#7, 200), true, [id=#28]
> ::  : +- LocalTableScan [i1#7, j1#8]
> ::  +- *(2) Sort [i2#18 ASC NULLS FIRST], false, 0
> :: +- Exchange hashpartitioning(i2#18, 200), true, [id=#29]
> ::+- LocalTableScan [i2#18, j2#19]
> :+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#34]
> :   +- LocalTableScan [i3#29, j3#30]
> +- *(5) Sort [i4#40 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(i4#40, 200), true, [id=#39]
>   +- LocalTableScan [i4#40, j4#41]
> {code}
>  But, since BroadcastHashJoinExec is only for equi-join, if the streamed side 
> has HashPartitioning, BroadcastHashJoinExec can utilize the info to eliminate 
> the exchange.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31869) BroadcastHashJoinExe's outputPartitioning can utilize the build side

2020-05-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120107#comment-17120107
 ] 

Apache Spark commented on SPARK-31869:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/28676

> BroadcastHashJoinExe's outputPartitioning can utilize the build side
> 
>
> Key: SPARK-31869
> URL: https://issues.apache.org/jira/browse/SPARK-31869
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Currently, the BroadcastHashJoinExec's outputPartitioning only uses the 
> streamed side's outputPartitioning. Thus, if the join key is from the build 
> side for the join where one side is BroadcastHashJoinExec:
> {code:java}
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
> val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
> val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2")
> val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3")
> val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4")
> // join1 is a sort merge join.
> val join1 = t1.join(t2, t1("i1") === t2("i2"))
> // join2 is a broadcast join where t3 is broadcasted.
> val join2 = join1.join(t3, join1("i1") === t3("i3"))
> // Join on the column from the broadcasted side (i3).
> val join3 = join2.join(t4, join2("i3") === t4("i4"))
> join3.explain
> {code}
> it produces Exchange hashpartitioning(i2#103, 200):
> {code:java}
> == Physical Plan ==
> *(6) SortMergeJoin [i3#29], [i4#40], Inner
> :- *(4) Sort [i3#29 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(i3#29, 200), true, [id=#55]
> : +- *(3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight
> ::- *(3) SortMergeJoin [i1#7], [i2#18], Inner
> ::  :- *(1) Sort [i1#7 ASC NULLS FIRST], false, 0
> ::  :  +- Exchange hashpartitioning(i1#7, 200), true, [id=#28]
> ::  : +- LocalTableScan [i1#7, j1#8]
> ::  +- *(2) Sort [i2#18 ASC NULLS FIRST], false, 0
> :: +- Exchange hashpartitioning(i2#18, 200), true, [id=#29]
> ::+- LocalTableScan [i2#18, j2#19]
> :+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#34]
> :   +- LocalTableScan [i3#29, j3#30]
> +- *(5) Sort [i4#40 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(i4#40, 200), true, [id=#39]
>   +- LocalTableScan [i4#40, j4#41]
> {code}
>  But, since BroadcastHashJoinExec is only for equi-join, if the streamed side 
> has HashPartitioning, BroadcastHashJoinExec can utilize the info to eliminate 
> the exchange.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31869) BroadcastHashJoinExe's outputPartitioning can utilize the build side

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31869:


Assignee: (was: Apache Spark)

> BroadcastHashJoinExe's outputPartitioning can utilize the build side
> 
>
> Key: SPARK-31869
> URL: https://issues.apache.org/jira/browse/SPARK-31869
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Currently, the BroadcastHashJoinExec's outputPartitioning only uses the 
> streamed side's outputPartitioning. Thus, if the join key is from the build 
> side for the join where one side is BroadcastHashJoinExec:
> {code:java}
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
> val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
> val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2")
> val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3")
> val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4")
> // join1 is a sort merge join.
> val join1 = t1.join(t2, t1("i1") === t2("i2"))
> // join2 is a broadcast join where t3 is broadcasted.
> val join2 = join1.join(t3, join1("i1") === t3("i3"))
> // Join on the column from the broadcasted side (i3).
> val join3 = join2.join(t4, join2("i3") === t4("i4"))
> join3.explain
> {code}
> it produces Exchange hashpartitioning(i2#103, 200):
> {code:java}
> == Physical Plan ==
> *(6) SortMergeJoin [i3#29], [i4#40], Inner
> :- *(4) Sort [i3#29 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(i3#29, 200), true, [id=#55]
> : +- *(3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight
> ::- *(3) SortMergeJoin [i1#7], [i2#18], Inner
> ::  :- *(1) Sort [i1#7 ASC NULLS FIRST], false, 0
> ::  :  +- Exchange hashpartitioning(i1#7, 200), true, [id=#28]
> ::  : +- LocalTableScan [i1#7, j1#8]
> ::  +- *(2) Sort [i2#18 ASC NULLS FIRST], false, 0
> :: +- Exchange hashpartitioning(i2#18, 200), true, [id=#29]
> ::+- LocalTableScan [i2#18, j2#19]
> :+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#34]
> :   +- LocalTableScan [i3#29, j3#30]
> +- *(5) Sort [i4#40 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(i4#40, 200), true, [id=#39]
>   +- LocalTableScan [i4#40, j4#41]
> {code}
>  But, since BroadcastHashJoinExec is only for equi-join, if the streamed side 
> has HashPartitioning, BroadcastHashJoinExec can utilize the info to eliminate 
> the exchange.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31869) BroadcastHashJoinExe's outputPartitioning can utilize the build side

2020-05-29 Thread Terry Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-31869:
--
Description: 
Currently, the BroadcastHashJoinExec's outputPartitioning only uses the 
streamed side's outputPartitioning. Thus, if the join key is from the build 
side for the join where one side is BroadcastHashJoinExec:
{code:java}
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2")
val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3")
val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4")

// join1 is a sort merge join.
val join1 = t1.join(t2, t1("i1") === t2("i2"))

// join2 is a broadcast join where t3 is broadcasted.
val join2 = join1.join(t3, join1("i1") === t3("i3"))

// Join on the column from the broadcasted side (i3).
val join3 = join2.join(t4, join2("i3") === t4("i4"))

join3.explain
{code}
it produces Exchange hashpartitioning(i2#103, 200):
{code:java}
== Physical Plan ==
*(6) SortMergeJoin [i3#29], [i4#40], Inner
:- *(4) Sort [i3#29 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i3#29, 200), true, [id=#55]
: +- *(3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight
::- *(3) SortMergeJoin [i1#7], [i2#18], Inner
::  :- *(1) Sort [i1#7 ASC NULLS FIRST], false, 0
::  :  +- Exchange hashpartitioning(i1#7, 200), true, [id=#28]
::  : +- LocalTableScan [i1#7, j1#8]
::  +- *(2) Sort [i2#18 ASC NULLS FIRST], false, 0
:: +- Exchange hashpartitioning(i2#18, 200), true, [id=#29]
::+- LocalTableScan [i2#18, j2#19]
:+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint))), [id=#34]
:   +- LocalTableScan [i3#29, j3#30]
+- *(5) Sort [i4#40 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i4#40, 200), true, [id=#39]
  +- LocalTableScan [i4#40, j4#41]
{code}
 But, since BroadcastHashJoinExec is only for equi-join, if the streamed side 
has HashPartitioning, BroadcastHashJoinExec can utilize the info to eliminate 
the exchange.

  was:
Currently, the BroadcastHashJoinExec's outputPartitioning only uses the 
streamed side's outputPartitioning. Thus, if the join key is from the build 
side for the join where one side is BroadcastHashJoinExec:
{code:java}
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
val df1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val df2 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i2", "j2")
val df3 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i3", "j3")
df1.write.format("parquet").bucketBy(8, "i1").saveAsTable("t1")
df3.write.format("parquet").bucketBy(8, "i3").saveAsTable("t3")
val t1 = spark.table("t1")
val t3 = spark.table("t3")
val join1 = t1.join(df2, t1("i1") === df2("i2"))
val join2 = join1.join(t3, join1("i2") === t3("i3"))
join2.explain
{code}
it produces Exchange hashpartitioning(i2#103, 200):
{code:java}
== Physical Plan ==
*(5) SortMergeJoin [i2#103], [i3#124], Inner
:- *(2) Sort [i2#103 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i2#103, 200)
: +- *(1) BroadcastHashJoin [i1#120], [i2#103], Inner, BuildRight
::- *(1) Project [i1#120, j1#121]
::  +- *(1) Filter isnotnull(i1#120)
:: +- *(1) FileScan parquet default.t1[i1#120,j1#121] Batched: 
true, Format: Parquet, Location: InMemoryFileIndex[], PartitionFilters: [], 
PushedFilters: [IsNotNull(i1)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8
:+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint)))
:   +- LocalTableScan [i2#103, j2#104]
+- *(4) Sort [i3#124 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i3#124, 200)
  +- *(3) Project [i3#124, j3#125]
 +- *(3) Filter isnotnull(i3#124)
+- *(3) FileScan parquet default.t3[i3#124,j3#125] Batched: true, 
Format: Parquet, Location: InMemoryFileIndex[], PartitionFilters: [], 
PushedFilters: [IsNotNull(i3)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8
{code}
 But, since BroadcastHashJoinExec is only for equi-join, if the streamed side 
has HashPartitioning, BroadcastHashJoinExec can utilize the info to eliminate 
the exchange.


> BroadcastHashJoinExe's outputPartitioning can utilize the build side
> 
>
> Key: SPARK-31869
> URL: https://issues.apache.org/jira/browse/SPARK-31869
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Currently, the BroadcastHashJoinExec's outputPartitioning only uses the 
> streamed side's outputPartitioning. Thus, if the join key 

[jira] [Created] (SPARK-31869) BroadcastHashJoinExe's outputPartitioning can utilize the build side

2020-05-29 Thread Terry Kim (Jira)
Terry Kim created SPARK-31869:
-

 Summary: BroadcastHashJoinExe's outputPartitioning can utilize the 
build side
 Key: SPARK-31869
 URL: https://issues.apache.org/jira/browse/SPARK-31869
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


Currently, the BroadcastHashJoinExec's outputPartitioning only uses the 
streamed side's outputPartitioning. Thus, if the join key is from the build 
side for the join where one side is BroadcastHashJoinExec:
{code:java}
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
val df1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val df2 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i2", "j2")
val df3 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i3", "j3")
df1.write.format("parquet").bucketBy(8, "i1").saveAsTable("t1")
df3.write.format("parquet").bucketBy(8, "i3").saveAsTable("t3")
val t1 = spark.table("t1")
val t3 = spark.table("t3")
val join1 = t1.join(df2, t1("i1") === df2("i2"))
val join2 = join1.join(t3, join1("i2") === t3("i3"))
join2.explain
{code}
it produces Exchange hashpartitioning(i2#103, 200):
{code:java}
== Physical Plan ==
*(5) SortMergeJoin [i2#103], [i3#124], Inner
:- *(2) Sort [i2#103 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i2#103, 200)
: +- *(1) BroadcastHashJoin [i1#120], [i2#103], Inner, BuildRight
::- *(1) Project [i1#120, j1#121]
::  +- *(1) Filter isnotnull(i1#120)
:: +- *(1) FileScan parquet default.t1[i1#120,j1#121] Batched: 
true, Format: Parquet, Location: InMemoryFileIndex[], PartitionFilters: [], 
PushedFilters: [IsNotNull(i1)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8
:+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint)))
:   +- LocalTableScan [i2#103, j2#104]
+- *(4) Sort [i3#124 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i3#124, 200)
  +- *(3) Project [i3#124, j3#125]
 +- *(3) Filter isnotnull(i3#124)
+- *(3) FileScan parquet default.t3[i3#124,j3#125] Batched: true, 
Format: Parquet, Location: InMemoryFileIndex[], PartitionFilters: [], 
PushedFilters: [IsNotNull(i3)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8
{code}
 But, since BroadcastHashJoinExec is only for equi-join, if the streamed side 
has HashPartitioning, BroadcastHashJoinExec can utilize the info to eliminate 
the exchange.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-29 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119931#comment-17119931
 ] 

Jeff Evans edited comment on SPARK-31779 at 5/29/20, 9:01 PM:
--

Thanks, using {{arrays_zip}} (along with an extra {{cast}} to a new {{struct}} 
to preserve the original field names, since {{arrays_zip}} seems to set them to 
"0", "1", etc.) seems to work.

I still find the behavior quite counterintuitive, but this workaround will 
allow me to solve the immediate requirement. 


was (Author: jeff.w.evans):
Thanks, using {{arrays_zip}} (along with an extra {{cast}} to preserve the 
existing field names, since {{arrays_zip}} seems to set them to "0", "1", etc.) 
seems to work.

I still find the behavior quite counterintuitive, but this workaround will 
allow me to solve the immediate requirement. 

> Redefining struct inside array incorrectly wraps child fields in array
> --
>
> Key: SPARK-31779
> URL: https://issues.apache.org/jira/browse/SPARK-31779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
> "child1": 5,
> "child2": [
>   {
> "child2First": "one",
> "child2Second": 2
>   }
> ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} {{struct}} that should 
> have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31541) Backport SPARK-26095 Disable parallelization in make-distibution.sh. (Avoid build hanging)

2020-05-29 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31541.
--
Fix Version/s: 2.4.6
 Assignee: Marcelo Masiero Vanzin
   Resolution: Fixed

> Backport SPARK-26095   Disable parallelization in make-distibution.sh. 
> (Avoid build hanging)
> 
>
> Key: SPARK-31541
> URL: https://issues.apache.org/jira/browse/SPARK-31541
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: Marcelo Masiero Vanzin
>Priority: Major
> Fix For: 2.4.6
>
>
> Backport SPARK-26095       Disable parallelization in make-distibution.sh. 
> (Avoid build hanging)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-29 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119931#comment-17119931
 ] 

Jeff Evans commented on SPARK-31779:


Thanks, using {{arrays_zip}} (along with an extra {{cast}} to preserve the 
existing field names, since {{arrays_zip}} seems to set them to "0", "1", etc.) 
seems to work.

I still find the behavior quite counterintuitive, but this workaround will 
allow me to solve the immediate requirement. 

> Redefining struct inside array incorrectly wraps child fields in array
> --
>
> Key: SPARK-31779
> URL: https://issues.apache.org/jira/browse/SPARK-31779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
> "child1": 5,
> "child2": [
>   {
> "child2First": "one",
> "child2Second": 2
>   }
> ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} {{struct}} that should 
> have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26095) make-distribution.sh is hanging in jenkins

2020-05-29 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-26095:
-
Fix Version/s: 2.4.6
Affects Version/s: 2.4.6

> make-distribution.sh is hanging in jenkins
> --
>
> Key: SPARK-26095
> URL: https://issues.apache.org/jira/browse/SPARK-26095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Critical
> Fix For: 2.4.6, 3.0.0
>
>
> See https://github.com/apache/spark/pull/23017 for further discussion.
> maven seems to get stuck here:
> {noformat}
> "BuilderThread 5" #80 prio=5 os_prio=0 tid=0x7f16b850 nid=0x7bcf 
> runnable [0x7f16882fd000]
>java.lang.Thread.State: RUNNABLE
> at org.jdom2.Element.isAncestor(Element.java:1052)
> at org.jdom2.ContentList.checkPreConditions(ContentList.java:222)
> at org.jdom2.ContentList.add(ContentList.java:244)
> at org.jdom2.Element.addContent(Element.java:950)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.insertAtPreferredLocation(MavenJDOMWriter.java:292)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateExclusion(MavenJDOMWriter.java:488)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateDependency(MavenJDOMWriter.java:1335)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateDependency(MavenJDOMWriter.java:386)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateModel(MavenJDOMWriter.java:1623)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.write(MavenJDOMWriter.java:2156)
> at 
> org.apache.maven.plugins.shade.pom.PomWriter.write(PomWriter.java:75)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.rewriteDependencyReducedPomIfWeHaveReduction(ShadeMojo.java:1049)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.createDependencyReducedPom(ShadeMojo.java:978)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.execute(ShadeMojo.java:538)
> at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:200)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:196)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> And in fact I see a bunch of threads stuck there. Trying a few different 
> things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31541) Backport SPARK-26095 Disable parallelization in make-distibution.sh. (Avoid build hanging)

2020-05-29 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau reopened SPARK-31541:
--

> Backport SPARK-26095   Disable parallelization in make-distibution.sh. 
> (Avoid build hanging)
> 
>
> Key: SPARK-31541
> URL: https://issues.apache.org/jira/browse/SPARK-31541
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-26095       Disable parallelization in make-distibution.sh. 
> (Avoid build hanging)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31541) Backport SPARK-26095 Disable parallelization in make-distibution.sh. (Avoid build hanging)

2020-05-29 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119930#comment-17119930
 ] 

Holden Karau commented on SPARK-31541:
--

Ran into issues with this during the RC7 build so I'm backporting this.

> Backport SPARK-26095   Disable parallelization in make-distibution.sh. 
> (Avoid build hanging)
> 
>
> Key: SPARK-31541
> URL: https://issues.apache.org/jira/browse/SPARK-31541
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-26095       Disable parallelization in make-distibution.sh. 
> (Avoid build hanging)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31756) Add real headless browser support for UI test

2020-05-29 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119792#comment-17119792
 ] 

Dongjoon Hyun commented on SPARK-31756:
---

This is resolved via https://github.com/apache/spark/pull/28627 .

> Add real headless browser support for UI test
> -
>
> Key: SPARK-31756
> URL: https://issues.apache.org/jira/browse/SPARK-31756
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests, Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> In the current master, there are two problems for UI test.
> 1. Lots of tests especially JavaScript related ones are done manually.
> Appearance is better to be confirmed by our eyes but logic should be tested 
> by test cases ideally.
>  
> 2. Compared to the real web browsers, HtmlUnit doesn't seem to support 
> JavaScript enough.
> I added a JavaScript related test before for SPARK-31534 using HtmlUnit which 
> is simple library based headless browser for test.
> The test I added works somehow but some JavaScript related error is shown in 
> unit-tests.log.
> {code:java}
> === EXCEPTION START 
> Exception 
> class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
> com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option 
> "sanitizeFn" provided type "window" but expected type "(null|function)". 
> (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
>         at 
> com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904)
>         at 
> net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
>         at 
> net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
>         at 
> com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835)
>         at 
> com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807)
>         at 
> com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216)
>         at 
> com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52)
>         at 
> com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102)
>         at 
> com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426)
>         at 
> com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: 
> Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type 
> "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
>         at 
> net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009)
>         at 
> net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
>         at 
> net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
>         at 
> net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
>         at 
> com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
>         at 
> net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
>         at 
> com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828)
>         at 
> com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889)
>         ... 10 more
> JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type 
> "window" but expected type "(null|function)".
> == CALLING JAVASCRIPT ==
>   function () {
>       throw e;
>   }
> === EXCEPTION END {code}
>  
> I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not 
> working even though it works on real browsers like Chrome, Safari and Firefox 
> without error.
> {code:java}
> [info] UISeleniumSuite:
> [info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 
> seconds, 745 milliseconds)
> [info]   The code passed to eventually never returned normally. Attempted 2 
> times over 12.910785232 seconds. Last failure message: 
> com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to 
> undefined "regeneratorRuntime" in strict mode 
> 

[jira] [Assigned] (SPARK-31868) Wrong results for week-based-year

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31868:


Assignee: (was: Apache Spark)

> Wrong results for week-based-year 
> --
>
> Key: SPARK-31868
> URL: https://issues.apache.org/jira/browse/SPARK-31868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>  Labels: correctness
>
> {code:sql}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
>  > ;
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1970-01-01 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1968-12-29 00:00:00
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1969-01-01 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1969-01-01 00:00:00
> {code}
> {code:java}
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(2000-01-01, -MM-dd)#73]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('2000-01-01', '-MM-dd');
> 1970-01-01 00:00:00
> {code}
> Seems that `-288` is static



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31868) Wrong results for week-based-year

2020-05-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119776#comment-17119776
 ] 

Apache Spark commented on SPARK-31868:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28674

> Wrong results for week-based-year 
> --
>
> Key: SPARK-31868
> URL: https://issues.apache.org/jira/browse/SPARK-31868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>  Labels: correctness
>
> {code:sql}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
>  > ;
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1970-01-01 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1968-12-29 00:00:00
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1969-01-01 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1969-01-01 00:00:00
> {code}
> {code:java}
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(2000-01-01, -MM-dd)#73]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('2000-01-01', '-MM-dd');
> 1970-01-01 00:00:00
> {code}
> Seems that `-288` is static



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31868) Wrong results for week-based-year

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31868:


Assignee: Apache Spark

> Wrong results for week-based-year 
> --
>
> Key: SPARK-31868
> URL: https://issues.apache.org/jira/browse/SPARK-31868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> {code:sql}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
>  > ;
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1970-01-01 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1968-12-29 00:00:00
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1969-01-01 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1969-01-01 00:00:00
> {code}
> {code:java}
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(2000-01-01, -MM-dd)#73]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('2000-01-01', '-MM-dd');
> 1970-01-01 00:00:00
> {code}
> Seems that `-288` is static



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31868) Wrong results for week-based-year

2020-05-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119777#comment-17119777
 ] 

Apache Spark commented on SPARK-31868:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28674

> Wrong results for week-based-year 
> --
>
> Key: SPARK-31868
> URL: https://issues.apache.org/jira/browse/SPARK-31868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>  Labels: correctness
>
> {code:sql}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
>  > ;
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1970-01-01 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1968-12-29 00:00:00
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1969-01-01 00:00:00
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
> 1969-01-01 00:00:00
> {code}
> {code:java}
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(2000-01-01, -MM-dd)#73]
> +- *(1) Scan OneRowRelation[]
> spark-sql> select to_timestamp('2000-01-01', '-MM-dd');
> 1970-01-01 00:00:00
> {code}
> Seems that `-288` is static



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31858) Upgrade commons-io to 2.5 in Hadoop 3.2 profile

2020-05-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31858.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28665
[https://github.com/apache/spark/pull/28665]

> Upgrade commons-io to 2.5 in Hadoop 3.2 profile
> ---
>
> Key: SPARK-31858
> URL: https://issues.apache.org/jira/browse/SPARK-31858
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31214) Upgrade Janino to 3.1.2

2020-05-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31214.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27860
[https://github.com/apache/spark/pull/27860]

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31214) Upgrade Janino to 3.1.2

2020-05-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31214:
-

Assignee: Dongjoon Hyun

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31214) Upgrade Janino to 3.1.2

2020-05-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31214:
-

Assignee: Jungtaek Lim  (was: Dongjoon Hyun)

> Upgrade Janino to 3.1.2
> ---
>
> Key: SPARK-31214
> URL: https://issues.apache.org/jira/browse/SPARK-31214
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2020-05-29 Thread Agrim Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119607#comment-17119607
 ] 

Agrim Bansal edited comment on SPARK-31815 at 5/29/20, 1:35 PM:


The best thing possible is if we can pass jdbc url as 

"jdbc:hive2://zk1:port1,zk2:port2/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_h...@commercial.abc.com"


was (Author: agrim):
The best thing possible is if we can pass jdbc url as 

"jdbc:hive2://anah_zk1:2181,anah_zk2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_h...@commercial.abc.com"

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31817) Pass-through of Kerberos credentials from Spark SQL to a jdbc source

2020-05-29 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119609#comment-17119609
 ] 

Gabor Somogyi commented on SPARK-31817:
---

Thanks [~hyukjin.kwon] pinging me! Interesting topic and happy to take a look 
at the approach. Now I'm going to have a week off but probably this won't be so 
super fast :)
>From my perspective it would be good to have some sort of document which 
>describes the use-case(s) what we would like to solve. It would be easyer to 
>speak the same language. When we have that we'll see hopefully obviously 
>whether it's possible or not.

As an initial thought. A JDBC source needs deployed keytab files on all nodes 
to do authentication (databases don't support delegation tokens) and I 
personally don't see any other option at the moment.


> Pass-through of Kerberos credentials from Spark SQL to a jdbc source
> 
>
> Key: SPARK-31817
> URL: https://issues.apache.org/jira/browse/SPARK-31817
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Luis Lozano Coira
>Priority: Major
>
> I am connecting to Spark SQL through the Thrift JDBC/ODBC server using 
> kerberos. From Spark SQL I have connected to a JDBC source using basic 
> authentication but I am interested in doing a pass-through of kerberos 
> credentials to this JDBC source. 
> Would it be possible to do something like that? If not possible, could you 
> consider adding this functionality?
> Anyway I would like to start testing this pass-through and try to develop an 
> approach by myself. How could this functionality be added? Could you give me 
> any indication to start this development?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2020-05-29 Thread Agrim Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119607#comment-17119607
 ] 

Agrim Bansal commented on SPARK-31815:
--

The best thing possible is if we can pass jdbc url as 

"jdbc:hive2://anah_zk1:2181,anah_zk2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/_h...@commercial.abc.com"

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2020-05-29 Thread Agrim Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119597#comment-17119597
 ] 

Agrim Bansal edited comment on SPARK-31815 at 5/29/20, 1:24 PM:


Hi [~gsomogyi], I am trying to read two dataframes from two different hive 
datasources(different clusters) over jdbc.

Like username and password we can pass for other datasources, is there any way 
to authenticate on different cluster with kerberos in case of hive ?

 


was (Author: agrim):
Hi [~gsomogyi], I am trying to read two dataframes from two different hive 
datasources(different clusters) over jdbc.

Like username and password we can pass for other datasources, is there any way 
to authenticate on different cluster with kerberos in case of hive ?

 

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2020-05-29 Thread Agrim Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119597#comment-17119597
 ] 

Agrim Bansal commented on SPARK-31815:
--

Hi [~gsomogyi], I am trying to read two dataframes from two different hive 
datasources(different clusters) over jdbc.

Like username and password we can pass for other datasources, is there any way 
to authenticate on different cluster with kerberos in case of hive ?

 

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31813) Cannot write snappy-compressed text files

2020-05-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119584#comment-17119584
 ] 

Hyukjin Kwon commented on SPARK-31813:
--

Well, ORC and Parquet has its own snappy logic IIRC. They might have fallback. 
At least I know they use different implementations.

> Cannot write snappy-compressed text files
> -
>
> Key: SPARK-31813
> URL: https://issues.apache.org/jira/browse/SPARK-31813
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.5
>Reporter: Ondrej Kokes
>Priority: Minor
>
> After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a 
> clean Docker image with default-jre), Spark fails to write text-based files 
> (CSV and JSON) with snappy compression. It can snappy compress parquet and 
> orc, gzipping CSVs also works.
> This is a clean PySpark installation, snappy jars are in place
> {{$ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy}}
> {{snappy-0.2.jar
> }}{{snappy-java-1.1.7.3.jar}}
> Repro 1 (Scala):
> $ spark-shell
> {{spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").parquet("tmp/foo")}}
> spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").csv("tmp/foo")
> The first (parquet) will work, the second one won't.
> Repro 2 (PySpark):
>  {{from pyspark.sql import SparkSession}}
>  {{if __name__ == '__main__':}}{{spark}}
>  {{  SparkSession.builder.appName('snappy_testing').getOrCreate()}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').parquet('tmp/works_fine')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'gzip').mode('overwrite').csv('tmp/also_works')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').csv('tmp/snappy_not_found')}}
>   
>  In either case I get the following traceback
> java.lang.RuntimeException: native snappy library not available: this version 
> of libhadoop was built without snappy support.java.lang.RuntimeException: 
> native snappy library not available: this version of libhadoop was built 
> without snappy support. at 
> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134)
>  at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) 
> at 
> org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at scala.Option.map(Option.scala:146) at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CSVFileFormat.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:123) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-31813) Cannot write snappy-compressed text files

2020-05-29 Thread Ondrej Kokes (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119574#comment-17119574
 ] 

Ondrej Kokes commented on SPARK-31813:
--

[~hyukjin.kwon] I'm not disputing the lack of functionality, I'm fine with 
installing an additional library, I'm reporting an *inconsistency,* where I can 
write snappy-compressed parquet/orc, but not JSON/CSV - I should either do both 
of these or none of these.

> Cannot write snappy-compressed text files
> -
>
> Key: SPARK-31813
> URL: https://issues.apache.org/jira/browse/SPARK-31813
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.5
>Reporter: Ondrej Kokes
>Priority: Minor
>
> After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a 
> clean Docker image with default-jre), Spark fails to write text-based files 
> (CSV and JSON) with snappy compression. It can snappy compress parquet and 
> orc, gzipping CSVs also works.
> This is a clean PySpark installation, snappy jars are in place
> {{$ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy}}
> {{snappy-0.2.jar
> }}{{snappy-java-1.1.7.3.jar}}
> Repro 1 (Scala):
> $ spark-shell
> {{spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").parquet("tmp/foo")}}
> spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").csv("tmp/foo")
> The first (parquet) will work, the second one won't.
> Repro 2 (PySpark):
>  {{from pyspark.sql import SparkSession}}
>  {{if __name__ == '__main__':}}{{spark}}
>  {{  SparkSession.builder.appName('snappy_testing').getOrCreate()}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').parquet('tmp/works_fine')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'gzip').mode('overwrite').csv('tmp/also_works')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').csv('tmp/snappy_not_found')}}
>   
>  In either case I get the following traceback
> java.lang.RuntimeException: native snappy library not available: this version 
> of libhadoop was built without snappy support.java.lang.RuntimeException: 
> native snappy library not available: this version of libhadoop was built 
> without snappy support. at 
> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134)
>  at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) 
> at 
> org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at scala.Option.map(Option.scala:146) at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CSVFileFormat.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:123) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SPARK-31813) Cannot write snappy-compressed text files

2020-05-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119568#comment-17119568
 ] 

Hyukjin Kwon commented on SPARK-31813:
--

Seems like you should have native libraries as the error says. You should 
manually install the native library of snappy and let Hadoop knows it. It's not 
a Spark issue.

> Cannot write snappy-compressed text files
> -
>
> Key: SPARK-31813
> URL: https://issues.apache.org/jira/browse/SPARK-31813
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.5
>Reporter: Ondrej Kokes
>Priority: Minor
>
> After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a 
> clean Docker image with default-jre), Spark fails to write text-based files 
> (CSV and JSON) with snappy compression. It can snappy compress parquet and 
> orc, gzipping CSVs also works.
> This is a clean PySpark installation, snappy jars are in place
> {{$ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy}}
> {{snappy-0.2.jar
> }}{{snappy-java-1.1.7.3.jar}}
> Repro 1 (Scala):
> $ spark-shell
> {{spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").parquet("tmp/foo")}}
> spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").csv("tmp/foo")
> The first (parquet) will work, the second one won't.
> Repro 2 (PySpark):
>  {{from pyspark.sql import SparkSession}}
>  {{if __name__ == '__main__':}}{{spark}}
>  {{  SparkSession.builder.appName('snappy_testing').getOrCreate()}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').parquet('tmp/works_fine')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'gzip').mode('overwrite').csv('tmp/also_works')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').csv('tmp/snappy_not_found')}}
>   
>  In either case I get the following traceback
> java.lang.RuntimeException: native snappy library not available: this version 
> of libhadoop was built without snappy support.java.lang.RuntimeException: 
> native snappy library not available: this version of libhadoop was built 
> without snappy support. at 
> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134)
>  at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) 
> at 
> org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at scala.Option.map(Option.scala:146) at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CSVFileFormat.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:123) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-31813) Cannot write snappy-compressed text files

2020-05-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31813.
--
Resolution: Invalid

> Cannot write snappy-compressed text files
> -
>
> Key: SPARK-31813
> URL: https://issues.apache.org/jira/browse/SPARK-31813
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.5
>Reporter: Ondrej Kokes
>Priority: Minor
>
> After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a 
> clean Docker image with default-jre), Spark fails to write text-based files 
> (CSV and JSON) with snappy compression. It can snappy compress parquet and 
> orc, gzipping CSVs also works.
> This is a clean PySpark installation, snappy jars are in place
> {{$ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy}}
> {{snappy-0.2.jar
> }}{{snappy-java-1.1.7.3.jar}}
> Repro 1 (Scala):
> $ spark-shell
> {{spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").parquet("tmp/foo")}}
> spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").csv("tmp/foo")
> The first (parquet) will work, the second one won't.
> Repro 2 (PySpark):
>  {{from pyspark.sql import SparkSession}}
>  {{if __name__ == '__main__':}}{{spark}}
>  {{  SparkSession.builder.appName('snappy_testing').getOrCreate()}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').parquet('tmp/works_fine')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'gzip').mode('overwrite').csv('tmp/also_works')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').csv('tmp/snappy_not_found')}}
>   
>  In either case I get the following traceback
> java.lang.RuntimeException: native snappy library not available: this version 
> of libhadoop was built without snappy support.java.lang.RuntimeException: 
> native snappy library not available: this version of libhadoop was built 
> without snappy support. at 
> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134)
>  at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) 
> at 
> org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at scala.Option.map(Option.scala:146) at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CSVFileFormat.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:123) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31814) Null in Date conversion from yyMMddHHmmss for specific date and time

2020-05-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119566#comment-17119566
 ] 

Hyukjin Kwon commented on SPARK-31814:
--

Yes, it seems fixed in the master.

> Null in Date conversion from yyMMddHHmmss for specific date and time
> 
>
> Key: SPARK-31814
> URL: https://issues.apache.org/jira/browse/SPARK-31814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Spark Version : 2.3.0.2.6.5.0-292
> Distribution : Hortonworks
>Reporter: Sunny Jain
>Priority: Minor
>
> Hi,
>  
> We are trying to convert a column with string datatype to date type using 
> below example. It seems to work for all timestamps except for the timestamp 
> for 31st March 2019 02:**:** like  19033102. Can you please look into it. 
> Thanks.
> {code}
> scala> sql("select to_date('19033100','yyMMddHHmmss')").show(false)
> ++
> | to_date('19033100', 'yyMMddHHmmss')|
> ++
> |2019-03-31                  |
> ++
> {code}
>  
>  
>  Interstingly below is not working for highlighted hours (02). 
> {code}
> scala> sql("select to_date('19033102','yyMMddHHmmss')").show(false)
> ++
> |   to_date('190331', 'yyMMddHHmmss')|
> ++
> |null                                            |
> ++
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31814) Null in Date conversion from yyMMddHHmmss for specific date and time

2020-05-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31814.
--
Resolution: Cannot Reproduce

> Null in Date conversion from yyMMddHHmmss for specific date and time
> 
>
> Key: SPARK-31814
> URL: https://issues.apache.org/jira/browse/SPARK-31814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Spark Version : 2.3.0.2.6.5.0-292
> Distribution : Hortonworks
>Reporter: Sunny Jain
>Priority: Minor
>
> Hi,
>  
> We are trying to convert a column with string datatype to date type using 
> below example. It seems to work for all timestamps except for the timestamp 
> for 31st March 2019 02:**:** like  19033102. Can you please look into it. 
> Thanks.
> {code}
> scala> sql("select to_date('19033100','yyMMddHHmmss')").show(false)
> ++
> | to_date('19033100', 'yyMMddHHmmss')|
> ++
> |2019-03-31                  |
> ++
> {code}
>  
>  
>  Interstingly below is not working for highlighted hours (02). 
> {code}
> scala> sql("select to_date('19033102','yyMMddHHmmss')").show(false)
> ++
> |   to_date('190331', 'yyMMddHHmmss')|
> ++
> |null                                            |
> ++
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31814) Null in Date conversion from yyMMddHHmmss for specific date and time

2020-05-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31814:
-
Description: 
Hi,

 

We are trying to convert a column with string datatype to date type using below 
example. It seems to work for all timestamps except for the timestamp for 31st 
March 2019 02:**:** like  19033102. Can you please look into it. Thanks.

{code}
scala> sql("select to_date('19033100','yyMMddHHmmss')").show(false)

++
| to_date('19033100', 'yyMMddHHmmss')|
++
|2019-03-31                  |
++
{code}

 

 

 Interstingly below is not working for highlighted hours (02). 

{code}

scala> sql("select to_date('19033102','yyMMddHHmmss')").show(false)

++
|   to_date('190331', 'yyMMddHHmmss')|
++
|null                                            |
++
{code}

 

 

  was:
Hi,

 

We are trying to convert a column with string datatype to date type using below 
example. It seems to work for all timestamps except for the timestamp for 31st 
March 2019 02:**:** like  19033102. Can you please look into it. Thanks.

{code}
scala> sql("select to_date('19033100','yyMMddHHmmss')").show(false)

++
|to_date('19033100', 'yyMMddHHmmss')|
++
|2019-03-31                                                     |
++
{code}

 

 

 Interstingly below is not working for highlighted hours (02). 

{code}

scala> sql("select to_date('19033102','yyMMddHHmmss')").show(false)

++
|to_date('190331', 'yyMMddHHmmss')|
++
|null                                                                   |
++
{code}

 

 


> Null in Date conversion from yyMMddHHmmss for specific date and time
> 
>
> Key: SPARK-31814
> URL: https://issues.apache.org/jira/browse/SPARK-31814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Spark Version : 2.3.0.2.6.5.0-292
> Distribution : Hortonworks
>Reporter: Sunny Jain
>Priority: Minor
>
> Hi,
>  
> We are trying to convert a column with string datatype to date type using 
> below example. It seems to work for all timestamps except for the timestamp 
> for 31st March 2019 02:**:** like  19033102. Can you please look into it. 
> Thanks.
> {code}
> scala> sql("select to_date('19033100','yyMMddHHmmss')").show(false)
> ++
> | to_date('19033100', 'yyMMddHHmmss')|
> ++
> |2019-03-31                  |
> ++
> {code}
>  
>  
>  Interstingly below is not working for highlighted hours (02). 
> {code}
> scala> sql("select to_date('19033102','yyMMddHHmmss')").show(false)
> ++
> |   to_date('190331', 'yyMMddHHmmss')|
> ++
> |null                                            |
> ++
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31814) Null in Date conversion from yyMMddHHmmss for specific date and time

2020-05-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31814:
-
Description: 
Hi,

 

We are trying to convert a column with string datatype to date type using below 
example. It seems to work for all timestamps except for the timestamp for 31st 
March 2019 02:**:** like  19033102. Can you please look into it. Thanks.

{code}
scala> sql("select to_date('19033100','yyMMddHHmmss')").show(false)

++
|to_date('19033100', 'yyMMddHHmmss')|
++
|2019-03-31                                                     |
++
{code}

 

 

 Interstingly below is not working for highlighted hours (02). 

{code}

scala> sql("select to_date('19033102','yyMMddHHmmss')").show(false)

++
|to_date('190331', 'yyMMddHHmmss')|
++
|null                                                                   |
++
{code}

 

 

  was:
Hi,

 

We are trying to convert a column with string datatype to date type using below 
example. It seems to work for all timestamps except for the timestamp for 31st 
March 2019 02:**:** like  19033102. Can you please look into it. Thanks.

scala> sql("select to_date('19033100','yyMMddHHmmss')").show(false)

++
|to_date('19033100', 'yyMMddHHmmss')|

++
|2019-03-31                                                     |

++

 

 

 Interstingly below is not working for highlighted hours (02). 

scala> sql("select to_date('19033102','yyMMddHHmmss')").show(false)

++
|to_date('190331{color:#ff}02{color}', 'yyMMddHHmmss')|

++
|null                                                                   |

++

 

 


> Null in Date conversion from yyMMddHHmmss for specific date and time
> 
>
> Key: SPARK-31814
> URL: https://issues.apache.org/jira/browse/SPARK-31814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Spark Version : 2.3.0.2.6.5.0-292
> Distribution : Hortonworks
>Reporter: Sunny Jain
>Priority: Minor
>
> Hi,
>  
> We are trying to convert a column with string datatype to date type using 
> below example. It seems to work for all timestamps except for the timestamp 
> for 31st March 2019 02:**:** like  19033102. Can you please look into it. 
> Thanks.
> {code}
> scala> sql("select to_date('19033100','yyMMddHHmmss')").show(false)
> ++
> |to_date('19033100', 'yyMMddHHmmss')|
> ++
> |2019-03-31                                                     |
> ++
> {code}
>  
>  
>  Interstingly below is not working for highlighted hours (02). 
> {code}
> scala> sql("select to_date('19033102','yyMMddHHmmss')").show(false)
> ++
> |to_date('190331', 'yyMMddHHmmss')|
> ++
> |null                                                                   |
> ++
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31817) Pass-through of Kerberos credentials from Spark SQL to a jdbc source

2020-05-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119548#comment-17119548
 ] 

Hyukjin Kwon commented on SPARK-31817:
--

cc [~gsomogyi] FYI

> Pass-through of Kerberos credentials from Spark SQL to a jdbc source
> 
>
> Key: SPARK-31817
> URL: https://issues.apache.org/jira/browse/SPARK-31817
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Luis Lozano Coira
>Priority: Major
>
> I am connecting to Spark SQL through the Thrift JDBC/ODBC server using 
> kerberos. From Spark SQL I have connected to a JDBC source using basic 
> authentication but I am interested in doing a pass-through of kerberos 
> credentials to this JDBC source. 
> Would it be possible to do something like that? If not possible, could you 
> consider adding this functionality?
> Anyway I would like to start testing this pass-through and try to develop an 
> approach by myself. How could this functionality be added? Could you give me 
> any indication to start this development?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage

2020-05-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119546#comment-17119546
 ] 

Hyukjin Kwon commented on SPARK-31836:
--

I believe this is a long standing bug. SPARK-28153 didn't completely fix. Can 
you check it in the old versions just for doubly sure? If this is a regression, 
we should revert SPARK-28153 for now.

> input_file_name() gives wrong value following Python UDF usage
> --
>
> Key: SPARK-31836
> URL: https://issues.apache.org/jira/browse/SPARK-31836
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wesley Hildebrandt
>Priority: Major
>
> I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8.
> The following commands demonstrate that the input_file_name() function 
> sometimes returns the wrong filename following usage of a Python UDF:
> $ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done
> $ pyspark
> >>> import pyspark.sql.functions as F
> >>> spark.readStream.text('file:///tmp/test-file-*', 
> >>> wholetext=True).withColumn('file1', 
> >>> F.input_file_name()).withColumn('udf', F.udf(lambda 
> >>> x:x)('value')).withColumn('file2', 
> >>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda 
> >>> df,_: df.select('file1','file2').show(truncate=False, 
> >>> vertical=True)).start().awaitTermination()
> A few notes about this bug:
>  * It happens with many different files, so it's not related to the file 
> contents
>  * It also happens loading files from HDFS, so storage location is not a 
> factor
>  * It also happens using .csv() to read the files instead of .text(), so 
> input format is not a factor
>  * I have not been able to cause the error without using readStream, so it 
> seems to be related to streaming
>  * The bug also happens using spark-submit to send a job to my cluster
>  * I haven't tested an older version, but it's possible that Spark pulls 
> 24958 and 25321([https://github.com/apache/spark/pull/24958], 
> [https://github.com/apache/spark/pull/25321]) to fix issue 28153 
> (https://issues.apache.org/jira/browse/SPARK-28153) introduced this bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage

2020-05-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31836:
-
Component/s: (was: Spark Core)
 Structured Streaming
 SQL

> input_file_name() gives wrong value following Python UDF usage
> --
>
> Key: SPARK-31836
> URL: https://issues.apache.org/jira/browse/SPARK-31836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Wesley Hildebrandt
>Priority: Major
>
> I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8.
> The following commands demonstrate that the input_file_name() function 
> sometimes returns the wrong filename following usage of a Python UDF:
> $ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done
> $ pyspark
> >>> import pyspark.sql.functions as F
> >>> spark.readStream.text('file:///tmp/test-file-*', 
> >>> wholetext=True).withColumn('file1', 
> >>> F.input_file_name()).withColumn('udf', F.udf(lambda 
> >>> x:x)('value')).withColumn('file2', 
> >>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda 
> >>> df,_: df.select('file1','file2').show(truncate=False, 
> >>> vertical=True)).start().awaitTermination()
> A few notes about this bug:
>  * It happens with many different files, so it's not related to the file 
> contents
>  * It also happens loading files from HDFS, so storage location is not a 
> factor
>  * It also happens using .csv() to read the files instead of .text(), so 
> input format is not a factor
>  * I have not been able to cause the error without using readStream, so it 
> seems to be related to streaming
>  * The bug also happens using spark-submit to send a job to my cluster
>  * I haven't tested an older version, but it's possible that Spark pulls 
> 24958 and 25321([https://github.com/apache/spark/pull/24958], 
> [https://github.com/apache/spark/pull/25321]) to fix issue 28153 
> (https://issues.apache.org/jira/browse/SPARK-28153) introduced this bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31841) Dataset.repartition leverage adaptive execution

2020-05-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31841.
--
Resolution: Duplicate

> Dataset.repartition leverage adaptive execution
> ---
>
> Key: SPARK-31841
> URL: https://issues.apache.org/jira/browse/SPARK-31841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark branch-3.0 from may 1 this year
>Reporter: koert kuipers
>Priority: Minor
>
> hello,
> we are very happy users of adaptive query execution. its a great feature to 
> now have to think about and tune the number of partitions anymore in a 
> shuffle.
> i noticed that Dataset.groupBy consistently uses adaptive execution when its 
> enabled (e.g. i don't see the default 200 partitions) but when i do 
> Dataset.repartition it seems i am back to a hardcoded number of partitions.
> Should adaptive execution also be used for repartition? It would be nice to 
> be able to repartition without having to think about optimal number of 
> partitions.
> An example:
> {code:java}
> $ spark-shell --conf spark.sql.adaptive.enabled=true --conf 
> spark.sql.adaptive.advisoryPartitionSizeInBytes=10
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val x = (1 to 100).toDF
> x: org.apache.spark.sql.DataFrame = [value: int]
> scala> x.rdd.getNumPartitions
> res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions
> res1: Int = 200
> scala> x.groupBy("value").count.rdd.getNumPartitions
> res2: Int = 67
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26352) Join reordering should not change the order of output attributes

2020-05-29 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26352:
---
Summary: Join reordering should not change the order of output attributes  
(was: join reordering should not change the order of output attributes)

> Join reordering should not change the order of output attributes
> 
>
> Key: SPARK-26352
> URL: https://issues.apache.org/jira/browse/SPARK-26352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> The optimizer rule {{org.apache.spark.sql.catalyst.optimizer.ReorderJoin}} 
> performs join reordering on inner joins. This was introduced from SPARK-12032 
> in 2015-12.
> After it had reordered the joins, though, it didn't check whether or not the 
> column order (in terms of the {{output}} attribute list) is still the same as 
> before. Thus, it's possible to have a mismatch between the reordered column 
> order vs the schema that a DataFrame thinks it has.
> This can be demonstrated with the example:
> {code:none}
> spark.sql("create table table_a (x int, y int) using parquet")
> spark.sql("create table table_b (i int, j int) using parquet")
> spark.sql("create table table_c (a int, b int) using parquet")
> val df = spark.sql("with df1 as (select * from table_a cross join table_b) 
> select * from df1 join table_c on a = x and b = i")
> {code}
> here's what the DataFrame thinks:
> {code:none}
> scala> df.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: integer (nullable = true)
>  |-- i: integer (nullable = true)
>  |-- j: integer (nullable = true)
>  |-- a: integer (nullable = true)
>  |-- b: integer (nullable = true)
> {code}
> here's what the optimized plan thinks, after join reordering:
> {code:none}
> scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
> ${a.name}: ${a.dataType.typeName}"))
> |-- x: integer
> |-- y: integer
> |-- a: integer
> |-- b: integer
> |-- i: integer
> |-- j: integer
> {code}
> If we exclude the {{ReorderJoin}} rule (using Spark 2.4's optimizer rule 
> exclusion feature), it's back to normal:
> {code:none}
> scala> spark.conf.set("spark.sql.optimizer.excludedRules", 
> "org.apache.spark.sql.catalyst.optimizer.ReorderJoin")
> scala> val df = spark.sql("with df1 as (select * from table_a cross join 
> table_b) select * from df1 join table_c on a = x and b = i")
> df: org.apache.spark.sql.DataFrame = [x: int, y: int ... 4 more fields]
> scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
> ${a.name}: ${a.dataType.typeName}"))
> |-- x: integer
> |-- y: integer
> |-- i: integer
> |-- j: integer
> |-- a: integer
> |-- b: integer
> {code}
> Note that this column ordering problem leads to data corruption, and can 
> manifest itself in various symptoms:
> * Silently corrupting data, if the reordered columns happen to either have 
> matching types or have sufficiently-compatible types (e.g. all fixed length 
> primitive types are considered as "sufficiently compatible" in an UnsafeRow), 
> then only the resulting data is going to be wrong but it might not trigger 
> any alarms immediately. Or
> * Weird Java-level exceptions like {{java.lang.NegativeArraySizeException}}, 
> or even SIGSEGVs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector

2020-05-29 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119488#comment-17119488
 ] 

Gabor Somogyi commented on SPARK-31857:
---

I think it's better to add an API where external database connectors can be 
added instead of giving support for cloud provider databases.
WDYT [~smilegator]?


> Support Azure SQLDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31857
> URL: https://issues.apache.org/jira/browse/SPARK-31857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31868) Wrong results for week-based-year

2020-05-29 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31868:
-
Description: 
{code:sql}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
 > ;
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1970-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
{code}


{code:java}

== Physical Plan ==
*(1) Project [-288 AS to_timestamp(2000-01-01, -MM-dd)#73]
+- *(1) Scan OneRowRelation[]

spark-sql> select to_timestamp('2000-01-01', '-MM-dd');
1970-01-01 00:00:00
{code}


Seems that `-288` is static



  was:
{code:sql}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
 > ;
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1970-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
{code}





> Wrong results for week-based-year 
> --
>
> Key: SPARK-31868
> URL: https://issues.apache.org/jira/browse/SPARK-31868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>  Labels: correctness
>
> {code:sql}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
>  > ;
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select 

[jira] [Updated] (SPARK-31868) Wrong results for week-based-year

2020-05-29 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31868:
-
Description: 
{code:sql}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
 > ;
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1970-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
{code}




  was:
{code:sql}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
 > ;
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1970-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
{code}


{code:java}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp(date '1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-315648 AS to_timestamp(DATE '1969-01-01', 
-MM-dd)#121]
+- *(1) Scan OneRowRelation[]
{code}




> Wrong results for week-based-year 
> --
>
> Key: SPARK-31868
> URL: https://issues.apache.org/jira/browse/SPARK-31868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>  Labels: correctness
>
> {code:sql}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
>  > ;
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> 

[jira] [Updated] (SPARK-31868) Wrong results for week-based-year

2020-05-29 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31868:
-
Description: 
{code:sql}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
 > ;
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1970-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
{code}


{code:java}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp(date '1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-315648 AS to_timestamp(DATE '1969-01-01', 
-MM-dd)#121]
+- *(1) Scan OneRowRelation[]
{code}



  was:

{code:sql}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
 > ;
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1970-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
{code}



> Wrong results for week-based-year 
> --
>
> Key: SPARK-31868
> URL: https://issues.apache.org/jira/browse/SPARK-31868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>  Labels: correctness
>
> {code:sql}
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> spark.sql.legacy.timeParserPolicy exception
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
>  > ;
> == Physical Plan ==
> *(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
> == Physical Plan ==
> *(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
> +- *(1) Scan OneRowRelation[]
> spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
> 

[jira] [Created] (SPARK-31868) Wrong results for week-based-year

2020-05-29 Thread Kent Yao (Jira)
Kent Yao created SPARK-31868:


 Summary: Wrong results for week-based-year 
 Key: SPARK-31868
 URL: https://issues.apache.org/jira/browse/SPARK-31868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao



{code:sql}
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd')
 > ;
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#37]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#53]
+- *(1) Scan OneRowRelation[]


spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-288 AS to_timestamp(1969-01-01, -MM-dd)#69]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1970-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> explain select to_timestamp('1969-01-01', '-MM-dd');
== Physical Plan ==
*(1) Project [-318240 AS to_timestamp(1969-01-01, -MM-dd)#87]
+- *(1) Scan OneRowRelation[]


spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1968-12-29 00:00:00
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy   exception
spark-sql> select to_timestamp('1969-01-01', '-MM-dd');
1969-01-01 00:00:00
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31867) Fix silent data change for datetime formatting

2020-05-29 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31867:
-
Description: 
{code:java}
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
0001970-01-01
spark-sql>
{code}
For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when 
using the `NumberPrinterParser` to format it

{code:java}
switch (signStyle) {
  case EXCEEDS_PAD:
if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
  buf.append(decimalStyle.getPositiveSign());
}
break;
   
   
{code}
the `minWidth` == `len(y..y)`
the `EXCEED_POINTS` is 

{code:java}
/**
 * Array of 10 to the power of n.
 */
static final long[] EXCEED_POINTS = new long[] {
0L,
10L,
100L,
1000L,
1L,
10L,
100L,
1000L,
1L,
10L,
100L,
};
{code}

So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` 
will be raised.

 And at the caller side, for `from_unixtime`, the exception will be suppressed 
and silent data change occurs. for `date_format`, the 
`ArrayIndexOutOfBoundsException` will continue.



  was:
{code:java}
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
0001970-01-01
spark-sql>

For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when 
using the `NumberPrinterParser` to format it

```java
switch (signStyle) {
  case EXCEEDS_PAD:
if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
  buf.append(decimalStyle.getPositiveSign());
}
break;
   
   
``` 
the `minWidth` == `len(y..y)`
the `EXCEED_POINTS` is 

```java
/**
 * Array of 10 to the power of n.
 */
static final long[] EXCEED_POINTS = new long[] {
0L,
10L,
100L,
1000L,
1L,
10L,
100L,
1000L,
1L,
10L,
100L,
};
```

So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` 
will be raised.

 And at the caller side, for `from_unixtime`, the exception will be suppressed 
and silent data change occurs. for `date_format`, the 
`ArrayIndexOutOfBoundsException` will continue.
{code}



> Fix silent data change for datetime formatting 
> ---
>
> Key: SPARK-31867
> URL: https://issues.apache.org/jira/browse/SPARK-31867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>
> {code:java}
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> NULL
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> 0001970-01-01
> spark-sql>
> {code}
> For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when 
> using the `NumberPrinterParser` to format it
> {code:java}
> switch (signStyle) {
>   case EXCEEDS_PAD:
> if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
>   buf.append(decimalStyle.getPositiveSign());
> }
> break;
>
>
> {code}
> the `minWidth` == `len(y..y)`
> the `EXCEED_POINTS` is 
> {code:java}
> /**
>  * Array of 10 to the power of n.
>  */
> static final long[] EXCEED_POINTS = new long[] {
> 0L,
> 10L,
> 100L,
> 1000L,
> 1L,
> 10L,
> 100L,
> 1000L,
> 1L,
> 10L,
> 100L,
> };
> {code}
> So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` 
> will be raised.
>  And at the caller side, for `from_unixtime`, the exception will be 
> suppressed and silent data change occurs. for `date_format`, the 
> `ArrayIndexOutOfBoundsException` will continue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31867) Fix silent data change for datetime formatting

2020-05-29 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31867:
-
Description: 
{code:java}
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
0001970-01-01
spark-sql>

For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when 
using the `NumberPrinterParser` to format it

```java
switch (signStyle) {
  case EXCEEDS_PAD:
if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
  buf.append(decimalStyle.getPositiveSign());
}
break;
   
   
``` 
the `minWidth` == `len(y..y)`
the `EXCEED_POINTS` is 

```java
/**
 * Array of 10 to the power of n.
 */
static final long[] EXCEED_POINTS = new long[] {
0L,
10L,
100L,
1000L,
1L,
10L,
100L,
1000L,
1L,
10L,
100L,
};
```

So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` 
will be raised.

 And at the caller side, for `from_unixtime`, the exception will be suppressed 
and silent data change occurs. for `date_format`, the 
`ArrayIndexOutOfBoundsException` will continue.
{code}


  was:

{code:java}
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
0001970-01-01
spark-sql>
{code}



> Fix silent data change for datetime formatting 
> ---
>
> Key: SPARK-31867
> URL: https://issues.apache.org/jira/browse/SPARK-31867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>
> {code:java}
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> NULL
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> 0001970-01-01
> spark-sql>
> For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when 
> using the `NumberPrinterParser` to format it
> ```java
> switch (signStyle) {
>   case EXCEEDS_PAD:
> if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
>   buf.append(decimalStyle.getPositiveSign());
> }
> break;
>
>
> ``` 
> the `minWidth` == `len(y..y)`
> the `EXCEED_POINTS` is 
> ```java
> /**
>  * Array of 10 to the power of n.
>  */
> static final long[] EXCEED_POINTS = new long[] {
> 0L,
> 10L,
> 100L,
> 1000L,
> 1L,
> 10L,
> 100L,
> 1000L,
> 1L,
> 10L,
> 100L,
> };
> ```
> So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` 
> will be raised.
>  And at the caller side, for `from_unixtime`, the exception will be 
> suppressed and silent data change occurs. for `date_format`, the 
> `ArrayIndexOutOfBoundsException` will continue.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31867) Fix silent data change for datetime formatting

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31867:


Assignee: Apache Spark

> Fix silent data change for datetime formatting 
> ---
>
> Key: SPARK-31867
> URL: https://issues.apache.org/jira/browse/SPARK-31867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Blocker
>
> {code:java}
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> NULL
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> 0001970-01-01
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31867) Fix silent data change for datetime formatting

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31867:


Assignee: (was: Apache Spark)

> Fix silent data change for datetime formatting 
> ---
>
> Key: SPARK-31867
> URL: https://issues.apache.org/jira/browse/SPARK-31867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>
> {code:java}
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> NULL
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> 0001970-01-01
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31867) Fix silent data change for datetime formatting

2020-05-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119389#comment-17119389
 ] 

Apache Spark commented on SPARK-31867:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28673

> Fix silent data change for datetime formatting 
> ---
>
> Key: SPARK-31867
> URL: https://issues.apache.org/jira/browse/SPARK-31867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>
> {code:java}
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> NULL
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> 0001970-01-01
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31867) Fix silent data change for datetime formatting

2020-05-29 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31867:
-
Priority: Blocker  (was: Major)

> Fix silent data change for datetime formatting 
> ---
>
> Key: SPARK-31867
> URL: https://issues.apache.org/jira/browse/SPARK-31867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>
> {code:java}
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> NULL
> spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
> spark.sql.legacy.timeParserPolicy legacy
> spark-sql> select from_unixtime(1, 'yyy-MM-dd');
> 0001970-01-01
> spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31867) Fix silent data change for datetime formatting

2020-05-29 Thread Kent Yao (Jira)
Kent Yao created SPARK-31867:


 Summary: Fix silent data change for datetime formatting 
 Key: SPARK-31867
 URL: https://issues.apache.org/jira/browse/SPARK-31867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao



{code:java}
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy   legacy
spark-sql> select from_unixtime(1, 'yyy-MM-dd');
0001970-01-01
spark-sql>
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31866) Add partitioning hints in SQL reference

2020-05-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119355#comment-17119355
 ] 

Apache Spark commented on SPARK-31866:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28672

> Add partitioning hints in SQL reference
> ---
>
> Key: SPARK-31866
> URL: https://issues.apache.org/jira/browse/SPARK-31866
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add partitioning hints Coalesce/Repartition/Repartition_By_Range hints in SQL 
> reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31866) Add partitioning hints in SQL reference

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31866:


Assignee: Apache Spark

> Add partitioning hints in SQL reference
> ---
>
> Key: SPARK-31866
> URL: https://issues.apache.org/jira/browse/SPARK-31866
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add partitioning hints Coalesce/Repartition/Repartition_By_Range hints in SQL 
> reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31866) Add partitioning hints in SQL reference

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31866:


Assignee: Apache Spark

> Add partitioning hints in SQL reference
> ---
>
> Key: SPARK-31866
> URL: https://issues.apache.org/jira/browse/SPARK-31866
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add partitioning hints Coalesce/Repartition/Repartition_By_Range hints in SQL 
> reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31866) Add partitioning hints in SQL reference

2020-05-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31866:


Assignee: (was: Apache Spark)

> Add partitioning hints in SQL reference
> ---
>
> Key: SPARK-31866
> URL: https://issues.apache.org/jira/browse/SPARK-31866
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add partitioning hints Coalesce/Repartition/Repartition_By_Range hints in SQL 
> reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28481) More expressions should extend NullIntolerant

2020-05-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28481:
---

Assignee: Yuming Wang

> More expressions should extend NullIntolerant
> -
>
> Key: SPARK-28481
> URL: https://issues.apache.org/jira/browse/SPARK-28481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Josh Rosen
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic 
> for inferring {{IsNotNull}} constraints from expressions. An expression is 
> _null-intolerant_ if it returns {{null}} when any of its input expressions 
> are {{null}}.
> I've noticed that _most_ expressions are null-intolerant: anything which 
> extends UnaryExpression / BinaryExpression and keeps the default {{eval}} 
> method will be null-intolerant. However, only a subset of these expressions 
> mix in the {{NullIntolerant}} trait. As a result, we're missing out on the 
> opportunity to infer certain types of non-null constraints: for example, if 
> we see a {{WHERE length\(x\) > 10}} condition then we know that the column 
> {{x}} must be non-null and can push this non-null filter down to our 
> datasource scan.
> I can think of a few ways to fix this:
>  # Modify every relevant expression to mix in the {{NullIntolerant}} trait. 
> We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus 
> reflection) to help automate the process of identifying expressions which do 
> not override the default {{eval}}.
>  # Make a backwards-incompatible change to our abstract base class hierarchy 
> to add {{NullSafe*aryExpression}} abstract base classes which define the 
> {{nullSafeEval}} method and implement a {{final eval}} method, then leave 
> {{eval}} unimplemented in the regular {{*aryExpression}} base classes.
>  ** This would fix the somewhat weird code smell that we have today where 
> {{nullSafeEval}} has a default implementation which calls {{sys.error}}.
>  ** This would negatively impact users who have implemented custom Catalyst 
> expressions.
>  # Use runtime reflection to determine whether expressions are 
> null-intolerant by virtue of using one of the default null-intolerant 
> {{eval}} implementations. We can then use this in an {{isNullIntolerant}} 
> helper method which checks that classes either (a) extend {{NullIntolerant}} 
> or (b) are null-intolerant according to the reflective check (which is 
> basically just figuring out which concrete implementation the {{eval}} method 
> resolves to).
>  ** We only need to perform the reflection once _per-class_ and can cache the 
> result for the lifetime of the JVM, so the performance overheads would be 
> pretty small (especially compared to other non-cacheable reflection / 
> traversal costs in Catalyst).
>  ** The downside is additional complexity in the code which pattern-matches / 
> checks for null-intolerance.
> Of these approaches, I'm currently leaning towards option 1 (semi-automated 
> identification and manual update of hundreds of expressions): if we go with 
> that approach then we can perform a one-time catch-up to fix all existing 
> expressions. To handle ongoing maintenance (as we add new expressions), I'd 
> propose to add "is this null-intolerant?" to a checklist to use when 
> reviewing PRs which add new Catalyst expressions. 
> /cc [~maropu] [~viirya]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28481) More expressions should extend NullIntolerant

2020-05-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28481.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28626
[https://github.com/apache/spark/pull/28626]

> More expressions should extend NullIntolerant
> -
>
> Key: SPARK-28481
> URL: https://issues.apache.org/jira/browse/SPARK-28481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Josh Rosen
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic 
> for inferring {{IsNotNull}} constraints from expressions. An expression is 
> _null-intolerant_ if it returns {{null}} when any of its input expressions 
> are {{null}}.
> I've noticed that _most_ expressions are null-intolerant: anything which 
> extends UnaryExpression / BinaryExpression and keeps the default {{eval}} 
> method will be null-intolerant. However, only a subset of these expressions 
> mix in the {{NullIntolerant}} trait. As a result, we're missing out on the 
> opportunity to infer certain types of non-null constraints: for example, if 
> we see a {{WHERE length\(x\) > 10}} condition then we know that the column 
> {{x}} must be non-null and can push this non-null filter down to our 
> datasource scan.
> I can think of a few ways to fix this:
>  # Modify every relevant expression to mix in the {{NullIntolerant}} trait. 
> We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus 
> reflection) to help automate the process of identifying expressions which do 
> not override the default {{eval}}.
>  # Make a backwards-incompatible change to our abstract base class hierarchy 
> to add {{NullSafe*aryExpression}} abstract base classes which define the 
> {{nullSafeEval}} method and implement a {{final eval}} method, then leave 
> {{eval}} unimplemented in the regular {{*aryExpression}} base classes.
>  ** This would fix the somewhat weird code smell that we have today where 
> {{nullSafeEval}} has a default implementation which calls {{sys.error}}.
>  ** This would negatively impact users who have implemented custom Catalyst 
> expressions.
>  # Use runtime reflection to determine whether expressions are 
> null-intolerant by virtue of using one of the default null-intolerant 
> {{eval}} implementations. We can then use this in an {{isNullIntolerant}} 
> helper method which checks that classes either (a) extend {{NullIntolerant}} 
> or (b) are null-intolerant according to the reflective check (which is 
> basically just figuring out which concrete implementation the {{eval}} method 
> resolves to).
>  ** We only need to perform the reflection once _per-class_ and can cache the 
> result for the lifetime of the JVM, so the performance overheads would be 
> pretty small (especially compared to other non-cacheable reflection / 
> traversal costs in Catalyst).
>  ** The downside is additional complexity in the code which pattern-matches / 
> checks for null-intolerance.
> Of these approaches, I'm currently leaning towards option 1 (semi-automated 
> identification and manual update of hundreds of expressions): if we go with 
> that approach then we can perform a one-time catch-up to fix all existing 
> expressions. To handle ongoing maintenance (as we add new expressions), I'd 
> propose to add "is this null-intolerant?" to a checklist to use when 
> reviewing PRs which add new Catalyst expressions. 
> /cc [~maropu] [~viirya]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-05-29 Thread Paul Finkelshteyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119352#comment-17119352
 ] 

Paul Finkelshteyn commented on SPARK-31854:
---

What do you mean by query? It is written in Kotlin and is on top of report. 

If you need alternative query in Scala it will be like following


{code:java}
spark.conf.set("spark.sql.codegen.wholeStage", false)

Seq(1.asInstanceOf[Integer], null.asInstanceOf[Integer], 
3.asInstanceOf[Integer]).toDS().map(v=>(v,v)).show()
{code}

It also works when spark.sql.codegen.wholeStage is false and doesn't when it is 
on.


> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Paul Finkelshteyn
>Priority: Major
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31866) Add partitioning hints in SQL reference

2020-05-29 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31866:
--

 Summary: Add partitioning hints in SQL reference
 Key: SPARK-31866
 URL: https://issues.apache.org/jira/browse/SPARK-31866
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Add partitioning hints Coalesce/Repartition/Repartition_By_Range hints in SQL 
reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-05-29 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119334#comment-17119334
 ] 

Takeshi Yamamuro commented on SPARK-31854:
--

Could you show us a query to reproduce the issue in our env? If we don't have 
it, we cannot look into it. Anyway, thanks for the report.

> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Paul Finkelshteyn
>Priority: Major
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-29 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119319#comment-17119319
 ] 

Pablo Langa Blanco commented on SPARK-31779:


How about using array_zip?
{code:java}
// code placeholder
val newTop = struct(df("top").getField("child1").alias("child1"), 
arrays_zip(df("top").getField("child2").getField("child2Second"),df("top").getField("child2").getField("child2Third")).alias("child2"))
{code}

> Redefining struct inside array incorrectly wraps child fields in array
> --
>
> Key: SPARK-31779
> URL: https://issues.apache.org/jira/browse/SPARK-31779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a 
> sub-field, when that {{struct}} is itself inside an {{array}}, results in the 
> remaining (non-removed) {{struct}} fields themselves being incorrectly 
> wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
> StackOverflow answer and discussion thread.  I have debugged this code and 
> distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
> "child1": 5,
> "child2": [
>   {
> "child2First": "one",
> "child2Second": 2
>   }
> ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the 
> "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), 
> array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
> ARRAY>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} 
> is an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
> There is nothing in the definition of the {{newTop}} {{struct}} that should 
> have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org