[jira] [Commented] (SPARK-28293) Implement Spark's own GetTableTypesOperation

2019-07-07 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880069#comment-16880069
 ] 

Yuming Wang commented on SPARK-28293:
-

I'm working on

> Implement Spark's own GetTableTypesOperation
> 
>
> Key: SPARK-28293
> URL: https://issues.apache.org/jira/browse/SPARK-28293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Hive-1.2.1.png, Hive-2.3.5.png
>
>
> Build with Hive 1.2.1:
> !Hive-1.2.1.png!
> Build with Hive 2.3.5:
> !Hive-2.3.5.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28293) Implement Spark's own GetTableTypesOperation

2019-07-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28293:

Description: 
Build with Hive 1.2.1:

!Hive-1.2.1.png!

Build with Hive 2.3.5:

!Hive-2.3.5.png!

  was:
Build with Hive 1.2.1:

!image-2019-07-08-14-50-01-831.png!

Build with Hive 2.3.5:

!image-2019-07-08-14-52-48-963.png!


> Implement Spark's own GetTableTypesOperation
> 
>
> Key: SPARK-28293
> URL: https://issues.apache.org/jira/browse/SPARK-28293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Hive-1.2.1.png, Hive-2.3.5.png
>
>
> Build with Hive 1.2.1:
> !Hive-1.2.1.png!
> Build with Hive 2.3.5:
> !Hive-2.3.5.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28293) Implement Spark's own GetTableTypesOperation

2019-07-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28293:

Attachment: Hive-1.2.1.png

> Implement Spark's own GetTableTypesOperation
> 
>
> Key: SPARK-28293
> URL: https://issues.apache.org/jira/browse/SPARK-28293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Hive-1.2.1.png, Hive-2.3.5.png
>
>
> Build with Hive 1.2.1:
> !image-2019-07-08-14-50-01-831.png!
> Build with Hive 2.3.5:
> !image-2019-07-08-14-52-48-963.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28294) Support `spark.history.fs.cleaner.maxNum` configuration

2019-07-07 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28294:
-

 Summary: Support `spark.history.fs.cleaner.maxNum` configuration
 Key: SPARK-28294
 URL: https://issues.apache.org/jira/browse/SPARK-28294
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Up to now, Apache Spark maintains the event log directory by time policy, 
`spark.history.fs.cleaner.maxAge`. However, there are two issues.

1. Some file system has a limitation on the maximum number of files in a single 
directory. For example, HDFS `dfs.namenode.fs-limits.max-directory-items` is 
1024 * 1024 by default.
- 
https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

2. Spark is sometimes unable to to clean up some old log files due to 
permission issues. 

To handle both (1) and (2), this issue aims to support an additional number 
policy configuration for the event log directory, 
`spark.history.fs.cleaner.maxNum`. Spark can try to keep the number of files in 
the event log directory according to this policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28293) Implement Spark's own GetTableTypesOperation

2019-07-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28293:

Attachment: Hive-2.3.5.png

> Implement Spark's own GetTableTypesOperation
> 
>
> Key: SPARK-28293
> URL: https://issues.apache.org/jira/browse/SPARK-28293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Hive-2.3.5.png
>
>
> Build with Hive 1.2.1:
> !image-2019-07-08-14-50-01-831.png!
> Build with Hive 2.3.5:
> !image-2019-07-08-14-52-48-963.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28293) Implement Spark's own GetTableTypesOperation

2019-07-07 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28293:
---

 Summary: Implement Spark's own GetTableTypesOperation
 Key: SPARK-28293
 URL: https://issues.apache.org/jira/browse/SPARK-28293
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Build with Hive 1.2.1:

!image-2019-07-08-14-50-01-831.png!

Build with Hive 2.3.5:

!image-2019-07-08-14-52-48-963.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28292) Enable inject user-defined Hint

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28292:


Assignee: (was: Apache Spark)

> Enable inject user-defined Hint
> ---
>
> Key: SPARK-28292
> URL: https://issues.apache.org/jira/browse/SPARK-28292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> We can't inject hint to Analyzer, hope to add a extension entrance to inject 
> user-defined hint



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28292) Enable inject user-defined Hint

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28292:


Assignee: Apache Spark

> Enable inject user-defined Hint
> ---
>
> Key: SPARK-28292
> URL: https://issues.apache.org/jira/browse/SPARK-28292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> We can't inject hint to Analyzer, hope to add a extension entrance to inject 
> user-defined hint



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28292) Enable inject user-defined Hint

2019-07-07 Thread angerszhu (JIRA)
angerszhu created SPARK-28292:
-

 Summary: Enable inject user-defined Hint
 Key: SPARK-28292
 URL: https://issues.apache.org/jira/browse/SPARK-28292
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: angerszhu


We can't inject hint to Analyzer, hope to add a extension entrance to inject 
user-defined hint



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24497) ANSI SQL: Recursive query

2019-07-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24497:

Summary: ANSI SQL: Recursive query  (was: Recursive query)

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query

2019-07-07 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880051#comment-16880051
 ] 

Yuming Wang commented on SPARK-24497:
-

Feature ID: T131

> ANSI SQL: Recursive query
> -
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24497) Support recursive SQL query

2019-07-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24497:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-27764

> Support recursive SQL query
> ---
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24497) Recursive query

2019-07-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24497:

Summary: Recursive query  (was: Support recursive SQL query)

> Recursive query
> ---
>
> Key: SPARK-24497
> URL: https://issues.apache.org/jira/browse/SPARK-24497
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h3. *Examples*
> Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" 
> represents the structure of an organization as an adjacency list.
> {code:sql}
> CREATE TABLE department (
> id INTEGER PRIMARY KEY,  -- department ID
> parent_department INTEGER REFERENCES department, -- upper department ID
> name TEXT -- department name
> );
> INSERT INTO department (id, parent_department, "name")
> VALUES
>  (0, NULL, 'ROOT'),
>  (1, 0, 'A'),
>  (2, 1, 'B'),
>  (3, 2, 'C'),
>  (4, 2, 'D'),
>  (5, 0, 'E'),
>  (6, 4, 'F'),
>  (7, 5, 'G');
> -- department structure represented here is as follows:
> --
> -- ROOT-+->A-+->B-+->C
> --  | |
> --  | +->D-+->F
> --  +->E-+->G
> {code}
>  
>  To extract all departments under A, you can use the following recursive 
> query:
> {code:sql}
> WITH RECURSIVE subdepartment AS
> (
> -- non-recursive term
> SELECT * FROM department WHERE name = 'A'
> UNION ALL
> -- recursive term
> SELECT d.*
> FROM
> department AS d
> JOIN
> subdepartment AS sd
> ON (d.parent_department = sd.id)
> )
> SELECT *
> FROM subdepartment
> ORDER BY name;
> {code}
> More details:
> [http://wiki.postgresql.org/wiki/CTEReadme]
> [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27951) ANSI SQL: NTH_VALUE function

2019-07-07 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880032#comment-16880032
 ] 

Yuming Wang commented on SPARK-27951:
-

Feature ID: T618

> ANSI SQL: NTH_VALUE function
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27951) ANSI SQL: NTH_VALUE function

2019-07-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27951:

Summary: ANSI SQL: NTH_VALUE function  (was: Built-in function: NTH_VALUE)

> ANSI SQL: NTH_VALUE function
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28291) UDFs cannot be evaluated within inline table definition

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28291:
-
Description: 
{code}
spark.udf.register("udf", (input: Double) => input)
sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS 
DOUBLE))) v(x)")
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot evaluate expression 
CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 23
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
{code}


  was:
{code}
spark.udf.register("udf", (input: Double) => input)
sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS 
DOUBLE))) v(x)")
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot evaluate expression 
CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
{code}



> UDFs cannot be evaluated within inline table definition
> ---
>
> Key: SPARK-28291
> URL: https://issues.apache.org/jira/browse/SPARK-28291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> spark.udf.register("udf", (input: Double) => input)
> sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') 
> AS DOUBLE))) v(x)")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot evaluate expression 
> CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 23
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64)
>   at scala.collection.mu

[jira] [Updated] (SPARK-28291) UDFs cannot be evaluated within inline table definition

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28291:
-
Description: 
{code}
spark.udf.register("udf", (input: Double) => input)
sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS 
DOUBLE))) v(x)")
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot evaluate expression 
CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
{code}


  was:
{code}
spark.udf.register("udf", (input: Double) => input)
sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES 
(CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)")
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot evaluate expression 
CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
{code}



> UDFs cannot be evaluated within inline table definition
> ---
>
> Key: SPARK-28291
> URL: https://issues.apache.org/jira/browse/SPARK-28291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> spark.udf.register("udf", (input: Double) => input)
> sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') 
> AS DOUBLE))) v(x)")
> {code}
> {code}
> org.ap

[jira] [Updated] (SPARK-28291) UDFs cannot be evaluated within inline table definition

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28291:
-
Description: 
{code}
spark.udf.register("udf", (input: Double) => input)
sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES 
(CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)")
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot evaluate expression 
CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
{code}


  was:
{code}
spark.udf.register("udf", (input: Double) => input
sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES 
(CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)")
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot evaluate expression 
CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
{code}



> UDFs cannot be evaluated within inline table definition
> ---
>
> Key: SPARK-28291
> URL: https://issues.apache.org/jira/browse/SPARK-28291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> spark.udf.register("udf", (input: Double) => input)
> sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES 
>

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, analyze it, file or find the JIRA, skip the tests with 
comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```



{code}
10. You're ready. Please go for a PR! See 
https://github.com/apache/spark/pull/25069 as an example.

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.
Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala 
UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_n

[jira] [Commented] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base

2019-07-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880024#comment-16880024
 ] 

Apache Spark commented on SPARK-28273:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/25070

> Convert and port 'pgSQL/case.sql' into UDF test base
> 
>
> Key: SPARK-28273
> URL: https://issues.apache.org/jira/browse/SPARK-28273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See SPARK-27934



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28273:


Assignee: Apache Spark

> Convert and port 'pgSQL/case.sql' into UDF test base
> 
>
> Key: SPARK-28273
> URL: https://issues.apache.org/jira/browse/SPARK-28273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> See SPARK-27934



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28273:


Assignee: (was: Apache Spark)

> Convert and port 'pgSQL/case.sql' into UDF test base
> 
>
> Key: SPARK-28273
> URL: https://issues.apache.org/jira/browse/SPARK-28273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See SPARK-27934



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base

2019-07-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880023#comment-16880023
 ] 

Apache Spark commented on SPARK-28273:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/25070

> Convert and port 'pgSQL/case.sql' into UDF test base
> 
>
> Key: SPARK-28273
> URL: https://issues.apache.org/jira/browse/SPARK-28273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See SPARK-27934



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28270:


Assignee: Apache Spark

> Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
> 
>
> Key: SPARK-28270
> URL: https://issues.apache.org/jira/browse/SPARK-28270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> see SPARK-27770



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28270:


Assignee: (was: Apache Spark)

> Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
> 
>
> Key: SPARK-28270
> URL: https://issues.apache.org/jira/browse/SPARK-28270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> see SPARK-27770



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```



{code}
10. You're ready. Please go for a PR! See 
https://github.com/apache/spark/pull/25069 as an example.

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.
Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala 
UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "p

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```



{code}
10. You're ready. Please go for a PR!

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.
Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala 
UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", 

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```



{code}
10. You're ready. Please go for a PR!

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.
Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala 
UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": 

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}
10. You're ready. Please go for a PR!

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.
Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala 
UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {

[jira] [Created] (SPARK-28291) UDFs cannot be evaluated within inline table definition

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28291:


 Summary: UDFs cannot be evaluated within inline table definition
 Key: SPARK-28291
 URL: https://issues.apache.org/jira/browse/SPARK-28291
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


{code}
spark.udf.register("udf", (input: Double) => input
sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES 
(CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)")
{code}

{code}
org.apache.spark.sql.AnalysisException: cannot evaluate expression 
CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35)
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}
10. You're ready. Please go for a PR!

Note that registered UDFs all return strings - so there are some differences 
are expected.
 Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sql}}

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in 
the PR description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}

10. You're ready. Please go for a PR!

Note that registered UDFs all return strings - so there are some differences 
are expected.
 Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sql}}. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in 
the PR description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}
Note that registered UDFs all return strings - so there are some differences 
are expected.
 Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in 
the PR description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}
Note that registered UDFs all return strings - so there are some differences 
are expected.
 Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.


[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in 
the PR description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}
Note that registered UDFs all return strings - so there are some differences 
are expected.
 Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:

{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])

{code}
 
1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:

{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata

OrderedDict([(b'pandas',
  b'{"index_columns": [{"kind": "range", "name": null, "start": '
  b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
  b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
  b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
  b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
  b'mpy_type": "int64", "metadata": null}], "creator": {"library'
  b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])

{code}
 
1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in 
the PR description with the template below:
{code:java}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}
Note that registered UDFs all return strings - so there are some differences 
are expected.
 Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}

4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in 
the PR description with the template below:

{code}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.


> Convert applicable *.sql tests into UDF integrated test 

[jira] [Updated] (SPARK-28231) Adaptive execution should ignore RepartitionByExpression

2019-07-07 Thread Jrex Ge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jrex Ge updated SPARK-28231:

Issue Type: Wish  (was: Bug)

> Adaptive execution should ignore RepartitionByExpression
> 
>
> Key: SPARK-28231
> URL: https://issues.apache.org/jira/browse/SPARK-28231
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Jrex Ge
>Priority: Major
>
> Dataset repartitionby will modify the partition information by adaptive 
> execution



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}

4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or diff xxx.sql.out udf/xxx.sql.out
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in 
the PR description with the template below:

{code}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. _If possible_ - not required, when you open a PR. please attach {{git diff 
xxx.sql.out}} between 3. and 5. in the PR description with the template below:

{code}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.


> Convert applicable *.sql tests into UDF integrated test base
> 
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
>  We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
>  Namely most of plan related test cases might have to be converted.

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. _If possible_ - not required, when you open a PR. please attach {{git diff 
xxx.sql.out}} between 3. and 5. in the PR description with the template below:

{code}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. When you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. 
in the PR description with the template below:

{code}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.


> Convert applicable *.sql tests into UDF integrated test base
> 
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
>  We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
>  Namely most of plan related test cases might have to be converted.
> *Here is the rough contributio

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. When you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. 
in the PR description with the template below:

{code}
Diff comparing to 'xxx.sql'


```diff
...  # here you put 'git diff' results
```


{code}

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. When you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. 
in the PR description.

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.


> Convert applicable *.sql tests into UDF integrated test base
> 
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
>  We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
>  Namely most of plan related test cases might have to be converted.
> *Here is the rough contribution guide to follow:*
> 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from {{xxx.sq

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. When you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. 
in the PR description.

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. When you open a PR. please attach {{git diff xxx.sql.out}} in the PR 
description.

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.


> Convert applicable *.sql tests into UDF integrated test base
> 
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
>  We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
>  Namely most of plan related test cases might have to be converted.
> *Here is the rough contribution guide to follow:*
> 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from {{xxx.sql}}, 
> for now.
> 3. Run it below:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/u

[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert `udf(...)` into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sq}}`. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}

9. When you open a PR. please attach {{git diff xxx.sql.out}} in the PR 
description.

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sql}}. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.


> Convert applicable *.sql tests into UDF integrated test base
> 
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
>  We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
>  Namely most of plan related test cases might have to be converted.
> *Here is the rough contribution guide to follow:*
> 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from {{xxx.sql}}, 
> for now.
> 3. Run it below:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git add .
> {code}
> 4. Insert `udf(...)` into each statement. It is not required to add more 
> combinations.
>  And it is not strict about where to insert.
> 5. Run it below again:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git diff
> {code}
> 6. Compare results with original file, {{xxx.sq}}`. If the

[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-07-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879980#comment-16879980
 ] 

Sean Owen commented on SPARK-28264:
---

I generally like the rationalization of the various UDF types, as they do 
different things, things which aren't so obvious from the names. Anything we 
can do to clarify is a win.

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28290) Use `SslContextFactory.Server` instead of `SslContextFactory`

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28290:


Assignee: Apache Spark

> Use `SslContextFactory.Server` instead of `SslContextFactory`
> -
>
> Key: SPARK-28290
> URL: https://issues.apache.org/jira/browse/SPARK-28290
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> `SslContextFactory` is deprecated at Jetty 9.4. This issue replaces it with 
> `SslContextFactory.Server`.
> - 
> https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html
> - 
> https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28290) Use `SslContextFactory.Server` instead of `SslContextFactory`

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28290:


Assignee: (was: Apache Spark)

> Use `SslContextFactory.Server` instead of `SslContextFactory`
> -
>
> Key: SPARK-28290
> URL: https://issues.apache.org/jira/browse/SPARK-28290
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `SslContextFactory` is deprecated at Jetty 9.4. This issue replaces it with 
> `SslContextFactory.Server`.
> - 
> https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html
> - 
> https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28290) Use `SslContextFactory.Server` instead of `SslContextFactory`

2019-07-07 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28290:
-

 Summary: Use `SslContextFactory.Server` instead of 
`SslContextFactory`
 Key: SPARK-28290
 URL: https://issues.apache.org/jira/browse/SPARK-28290
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


`SslContextFactory` is deprecated at Jetty 9.4. This issue replaces it with 
`SslContextFactory.Server`.
- 
https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html
- 
https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sql}}. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.

We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]

Namely most of plan related test cases might have to be converted.



Here is the rough contribution guide to follow:

1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sql}}. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.


> Convert applicable *.sql tests into UDF integrated test base
> 
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
> We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
> Namely most of plan related test cases might have to be converted.
> *Here is the rough contribution guide to follow:*
> 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from {{xxx.sql}}, 
> for now.
> 3. Run it below:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git add .
> {code}
> 4. Insert {{udf(...)}} into each statement. It is not required to add more 
> combinations.
>  And it is not strict about where to insert.
> 5. Run it below again:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git diff
> {code}
> 6. Compare results with original file, {{xxx.sql}}. If there are no notable 
> diff, open a PR.
> 7. If there are diff, file or find the JIRA, skip the tests with comments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.

We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]

Namely most of plan related test cases might have to be converted.

Here is the rough contribution guide to follow:

1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sql}}. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.

We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql

Namely most of plan related test cases might have to be converted.



> Convert applicable *.sql tests into UDF integrated test base
> 
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
> We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
> Namely most of plan related test cases might have to be converted.
> Here is the rough contribution guide to follow:
> 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from {{xxx.sql}}, 
> for now.
> 3. Run it below:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git add .
> {code}
> 4. Insert {{udf(...)}} into each statement. It is not required to add more 
> combinations.
>  And it is not strict about where to insert.
> 5. Run it below again:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git diff
> {code}
> 6. Compare results with original file, {{xxx.sql}}. If there are no notable 
> diff, open a PR.
> 7. If there are diff, file or find the JIRA, skip the tests with comments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
-
Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.

We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]

Namely most of plan related test cases might have to be converted.



Here is the rough contribution guide to follow:

1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sql}}. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.

We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]

Namely most of plan related test cases might have to be converted.

Here is the rough contribution guide to follow:

1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from {{xxx.sql}}, for 
now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
{code}
6. Compare results with original file, {{xxx.sql}}. If there are no notable 
diff, open a PR.

7. If there are diff, file or find the JIRA, skip the tests with comments.


> Convert applicable *.sql tests into UDF integrated test base
> 
>
> Key: SPARK-27921
> URL: https://issues.apache.org/jira/browse/SPARK-27921
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
> We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
> Namely most of plan related test cases might have to be converted.
> Here is the rough contribution guide to follow:
> 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from {{xxx.sql}}, 
> for now.
> 3. Run it below:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git add .
> {code}
> 4. Insert {{udf(...)}} into each statement. It is not required to add more 
> combinations.
>  And it is not strict about where to insert.
> 5. Run it below again:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git diff
> {code}
> 6. Compare results with original file, {{xxx.sql}}. If there are no notable 
> diff, open a PR.
> 7. If there are diff, file or find the JIRA, skip the tests with comments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28288) Convert and port 'window.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28288:


 Summary: Convert and port 'window.sql' into UDF test base
 Key: SPARK-28288
 URL: https://issues.apache.org/jira/browse/SPARK-28288
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28289) Convert and port 'union.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28289:


 Summary: Convert and port 'union.sql' into UDF test base
 Key: SPARK-28289
 URL: https://issues.apache.org/jira/browse/SPARK-28289
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28285) Convert and port 'outer-join.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28285:


 Summary: Convert and port 'outer-join.sql' into UDF test base
 Key: SPARK-28285
 URL: https://issues.apache.org/jira/browse/SPARK-28285
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28283) Convert and port 'intersect-all.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28283:


 Summary: Convert and port 'intersect-all.sql' into UDF test base
 Key: SPARK-28283
 URL: https://issues.apache.org/jira/browse/SPARK-28283
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28286) Convert and port 'pivot.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28286:


 Summary: Convert and port 'pivot.sql' into UDF test base
 Key: SPARK-28286
 URL: https://issues.apache.org/jira/browse/SPARK-28286
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28287) Convert and port 'udaf.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28287:


 Summary: Convert and port 'udaf.sql' into UDF test base
 Key: SPARK-28287
 URL: https://issues.apache.org/jira/browse/SPARK-28287
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28284) Convert and port 'join-empty-relation.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28284:


 Summary: Convert and port 'join-empty-relation.sql' into UDF test 
base
 Key: SPARK-28284
 URL: https://issues.apache.org/jira/browse/SPARK-28284
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28280) Convert and port 'group-by.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28280:


 Summary: Convert and port 'group-by.sql' into UDF test base
 Key: SPARK-28280
 URL: https://issues.apache.org/jira/browse/SPARK-28280
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28282) Convert and port 'inline-table.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28282:


 Summary: Convert and port 'inline-table.sql' into UDF test base
 Key: SPARK-28282
 URL: https://issues.apache.org/jira/browse/SPARK-28282
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28281) Convert and port 'having.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28281:


 Summary: Convert and port 'having.sql' into UDF test base
 Key: SPARK-28281
 URL: https://issues.apache.org/jira/browse/SPARK-28281
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28279) Convert and port 'group-analysis.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28279:


 Summary: Convert and port 'group-analysis.sql' into UDF test base
 Key: SPARK-28279
 URL: https://issues.apache.org/jira/browse/SPARK-28279
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28277) Convert and port 'except.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28277:


 Summary: Convert and port 'except.sql' into UDF test base
 Key: SPARK-28277
 URL: https://issues.apache.org/jira/browse/SPARK-28277
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28276) Convert and port 'cross-join.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28276:


 Summary: Convert and port 'cross-join.sql' into UDF test base
 Key: SPARK-28276
 URL: https://issues.apache.org/jira/browse/SPARK-28276
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28278) Convert and port 'except-all.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28278:


 Summary: Convert and port 'except-all.sql' into UDF test base
 Key: SPARK-28278
 URL: https://issues.apache.org/jira/browse/SPARK-28278
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28275) Convert and port 'count.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28275:


 Summary: Convert and port 'count.sql' into UDF test base
 Key: SPARK-28275
 URL: https://issues.apache.org/jira/browse/SPARK-28275
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28271) Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28271:
-
Summary: Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base  
(was: Convert and port 'aggregates_part2.sql' into UDF test base)

> Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base
> 
>
> Key: SPARK-28271
> URL: https://issues.apache.org/jira/browse/SPARK-28271
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> see SPARK-27883



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28272) Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28272:
-
Summary: Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base  
(was: Convert and port 'aggregates_part3.sql' into UDF test base)

> Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base
> 
>
> Key: SPARK-28272
> URL: https://issues.apache.org/jira/browse/SPARK-28272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> see SPARK-27988



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28274) Convert and port 'pgSQL/window.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28274:
-
Summary: Convert and port 'pgSQL/window.sql' into UDF test base  (was: 
Convert and port 'window.sql' into UDF test base)

> Convert and port 'pgSQL/window.sql' into UDF test base
> --
>
> Key: SPARK-28274
> URL: https://issues.apache.org/jira/browse/SPARK-28274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See SPARK-23160



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28270:
-
Summary: Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base  
(was: Convert and port 'aggregates_part1.sql' into UDF test base)

> Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
> 
>
> Key: SPARK-28270
> URL: https://issues.apache.org/jira/browse/SPARK-28270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> see SPARK-27770



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28273:
-
Summary: Convert and port 'pgSQL/case.sql' into UDF test base  (was: 
Convert and port 'case.sql' into UDF test base)

> Convert and port 'pgSQL/case.sql' into UDF test base
> 
>
> Key: SPARK-28273
> URL: https://issues.apache.org/jira/browse/SPARK-28273
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See SPARK-27934



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28274) Convert and port 'window.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28274:


 Summary: Convert and port 'window.sql' into UDF test base
 Key: SPARK-28274
 URL: https://issues.apache.org/jira/browse/SPARK-28274
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


See SPARK-23160



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28273) Convert and port 'case.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28273:


 Summary: Convert and port 'case.sql' into UDF test base
 Key: SPARK-28273
 URL: https://issues.apache.org/jira/browse/SPARK-28273
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


See SPARK-27934



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28272) Convert and port 'aggregates_part3.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28272:


 Summary: Convert and port 'aggregates_part3.sql' into UDF test base
 Key: SPARK-28272
 URL: https://issues.apache.org/jira/browse/SPARK-28272
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


see SPARK-27988



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28270) Convert and port 'aggregates_part1.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28270:
-
Description: see SPARK-27770

> Convert and port 'aggregates_part1.sql' into UDF test base
> --
>
> Key: SPARK-28270
> URL: https://issues.apache.org/jira/browse/SPARK-28270
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> see SPARK-27770



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28271) Convert and port 'aggregates_part2.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28271:
-
Description: see SPARK-27883

> Convert and port 'aggregates_part2.sql' into UDF test base
> --
>
> Key: SPARK-28271
> URL: https://issues.apache.org/jira/browse/SPARK-28271
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> see SPARK-27883



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28271) Convert and port 'aggregates_part2.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28271:


 Summary: Convert and port 'aggregates_part2.sql' into UDF test base
 Key: SPARK-28271
 URL: https://issues.apache.org/jira/browse/SPARK-28271
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28270) Convert and port 'aggregates_part1.sql' into UDF test base

2019-07-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28270:


 Summary: Convert and port 'aggregates_part1.sql' into UDF test base
 Key: SPARK-28270
 URL: https://issues.apache.org/jira/browse/SPARK-28270
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879975#comment-16879975
 ] 

Hyukjin Kwon commented on SPARK-28269:
--

Seems like I can't open the image. Would you be able to specify Pandas, 
PyArrow, Python version and provide a full reproducer if possible?
It would be also better to just show error message with stacktrace (not image)

> ArrowStreamPandasSerializer get stack
> -
>
> Key: SPARK-28269
> URL: https://issues.apache.org/jira/browse/SPARK-28269
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Modi Tamam
>Priority: Major
> Attachments: Untitled.xcf
>
>
> I'm working with Pyspark version 2.4.3.
> I have a big data frame:
>  * ~15M rows
>  * ~130 columns
>  * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
> (pandas_df.toPickle() ) resulted with a file of size 2.5GB.
> I have some code that groups this data frame and applying a Pandas-UDF:
>  
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql import functions as F
> import pyarrow.parquet as pq
> import pyarrow as pa
> non_issued_patch="31.7996378000_35.2114362000"
> issued_patch = "31.7995787833_35.2121463045"
> @pandas_udf("patch_name string", PandasUDFType.GROUPED_MAP)
> def foo(pdf):
>  import pandas as pd
>  ret_val = pd.DataFrame({'patch_name': [pdf['patch_name'].iloc[0]]})
>  return ret_val
> full_df=spark.read.parquet('debug-mega-patch')
> df = full_df.filter(F.col("grouping_column") == issued_patch).cache()
> df.groupBy("grouping_column").apply(foo).repartition(1).write.mode('overwrite').parquet('debug-df/')
>  
> {code}
>  
> The above code gets stacked on the ArrowStreamPandasSerializer: (on the first 
> line when reading batch from the reader)
>  
> {code:java}
> for batch in reader:
>  yield [self.arrow_to_pandas(c) for c in  
> pa.Table.from_batches([batch]).itercolumns()]{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28255) Upgrade dependencies with vulnerabilities

2019-07-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879973#comment-16879973
 ] 

Hyukjin Kwon commented on SPARK-28255:
--

[~bozho], so what do you suggest to upgrade? Users can use higher Hadoop 
version already. Py4J seems fine - please manually check what's the issue, and 
report it in Py4J.

> Upgrade dependencies with vulnerabilities
> -
>
> Key: SPARK-28255
> URL: https://issues.apache.org/jira/browse/SPARK-28255
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Bozhidar Bozhanov
>Priority: Major
>
> There are severe vulnerabilities in two dependencies:
>  
> [ERROR] hadoop-mapreduce-client-core-2.7.3.jar: CVE-2018-8029, 
> CVE-2016-6811[ERROR] py4j-0.10.8.1.jar: CVE-2016-5636, CVE-2008-1887



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28255) Upgrade dependencies with vulnerabilities

2019-07-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28255.
--
Resolution: Invalid

> Upgrade dependencies with vulnerabilities
> -
>
> Key: SPARK-28255
> URL: https://issues.apache.org/jira/browse/SPARK-28255
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Bozhidar Bozhanov
>Priority: Major
>
> There are severe vulnerabilities in two dependencies:
>  
> [ERROR] hadoop-mapreduce-client-core-2.7.3.jar: CVE-2018-8029, 
> CVE-2016-6811[ERROR] py4j-0.10.8.1.jar: CVE-2016-5636, CVE-2008-1887



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-07-07 Thread Luke Chu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chu updated SPARK-28189:
-
Description: 
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate', 1),('Monkey', 2),('Ninja', 3),('Spaghetti', 4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga', 1),('Pirate', 2),('Ninja', 3),('Darth Vader', 4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']
# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop('caps') # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 

  was:
Column names in general are case insensitive in Pyspark, and df.drop() in 
general is also case insensitive.

However, when referring to an upstream table, such as from a join, e.g.
{code:java}
vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
df1 = spark.createDataFrame(vals1, ['KEY','field'])

vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])


df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
{code}
 

drop will become case sensitive. e.g.
{code:java}
# from above, df1 consists of columns ['KEY', 'field']
# from above, df2 consists of columns ['KEY', 'CAPS']

df_joined.select(df2['key']) # will give a result
df_joined.drop('caps') # will also give a result
{code}
however, note the following
{code:java}
df_joined.drop(df2['key']) # no-op
df_joined.drop(df2['caps']) # no-op

df_joined.drop(df2['KEY']) # will drop column as expected
df_joined.drop(df2['CAPS']) # will drop column as expected

{code}
 

 

so in summary, using df.drop(df2['col']) doesn't align with expected case 
insensitivity for column names, even though functions like select, join, and 
dropping a column generally are case insensitive.

 


> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Luke Chu
>Assignee: Tony Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Column names in general are case insensitive in Pyspark, and df.drop() in 
> general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate', 1),('Monkey', 2),('Ninja', 3),('Spaghetti', 4)]
> df1 = spark.createDataFrame(vals1, ['KEY','field'])
> vals2 = [('Rutabaga', 1),('Pirate', 2),('Ninja', 3),('Darth Vader', 4)]
> df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop('caps') # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28266) data correctness issue: data duplication when `path` serde peroperty is present

2019-07-07 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-28266:
---
Labels: correctness  (was: )

> data correctness issue: data duplication when `path` serde peroperty is 
> present
> ---
>
> Key: SPARK-28266
> URL: https://issues.apache.org/jira/browse/SPARK-28266
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: correctness
>
> Spark duplicates returned datasets when `path` serde is present in a parquet 
> table. 
> Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4.
> Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 
> at least).
> Reproducer:
> {code:python}
> >>> spark.sql("create table ruslan_test.test55 as select 1 as id")
> DataFrame[]
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16]
> >>> spark.table("ruslan_test.test55").count()
> 1
> {code}
> (all is good at this point, now exist session and run in Hive for example - )
> {code:sql}
> ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( 
> 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' )
> {code}
> So LOCATION and serde `path` property would point to the same location.
> Now see count returns two records instead of one:
> {code:python}
> >>> spark.table("ruslan_test.test55").count()
> 2
> >>> spark.table("ruslan_test.test55").explain()
> == Physical Plan ==
> *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct
> >>>
> {code}
> Also notice that the presence of `path` serde property makes TABLE location 
> show up twice - 
> {quote}
> InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, 
> hdfs://epsdatalake/hive..., 
> {quote}
> We have some applications that create parquet tables in Hive with `path` 
> serde property
> and it makes data duplicate in query results. 
> Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but 
> not Spark 2.2 and later releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-07 Thread Modi Tamam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Modi Tamam updated SPARK-28269:
---
Attachment: Untitled.xcf

> ArrowStreamPandasSerializer get stack
> -
>
> Key: SPARK-28269
> URL: https://issues.apache.org/jira/browse/SPARK-28269
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Modi Tamam
>Priority: Major
> Attachments: Untitled.xcf
>
>
> I'm working with Pyspark version 2.4.3.
> I have a big data frame:
>  * ~15M rows
>  * ~130 columns
>  * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
> (pandas_df.toPickle() ) resulted with a file of size 2.5GB.
> I have some code that groups this data frame and applying a Pandas-UDF:
>  
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql import functions as F
> import pyarrow.parquet as pq
> import pyarrow as pa
> non_issued_patch="31.7996378000_35.2114362000"
> issued_patch = "31.7995787833_35.2121463045"
> @pandas_udf("patch_name string", PandasUDFType.GROUPED_MAP)
> def foo(pdf):
>  import pandas as pd
>  ret_val = pd.DataFrame({'patch_name': [pdf['patch_name'].iloc[0]]})
>  return ret_val
> full_df=spark.read.parquet('debug-mega-patch')
> df = full_df.filter(F.col("grouping_column") == issued_patch).cache()
> df.groupBy("grouping_column").apply(foo).repartition(1).write.mode('overwrite').parquet('debug-df/')
>  
> {code}
>  
> The above code gets stacked on the ArrowStreamPandasSerializer: (on the first 
> line when reading batch from the reader)
>  
> {code:java}
> for batch in reader:
>  yield [self.arrow_to_pandas(c) for c in  
> pa.Table.from_batches([batch]).itercolumns()]{code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28269) ArrowStreamPandasSerializer get stack

2019-07-07 Thread Modi Tamam (JIRA)
Modi Tamam created SPARK-28269:
--

 Summary: ArrowStreamPandasSerializer get stack
 Key: SPARK-28269
 URL: https://issues.apache.org/jira/browse/SPARK-28269
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Modi Tamam


I'm working with Pyspark version 2.4.3.

I have a big data frame:
 * ~15M rows
 * ~130 columns
 * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it 
(pandas_df.toPickle() ) resulted with a file of size 2.5GB.

I have some code that groups this data frame and applying a Pandas-UDF:

 
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import functions as F
import pyarrow.parquet as pq
import pyarrow as pa
non_issued_patch="31.7996378000_35.2114362000"
issued_patch = "31.7995787833_35.2121463045"

@pandas_udf("patch_name string", PandasUDFType.GROUPED_MAP)
def foo(pdf):
 import pandas as pd
 ret_val = pd.DataFrame({'patch_name': [pdf['patch_name'].iloc[0]]})
 return ret_val

full_df=spark.read.parquet('debug-mega-patch')
df = full_df.filter(F.col("grouping_column") == issued_patch).cache()

df.groupBy("grouping_column").apply(foo).repartition(1).write.mode('overwrite').parquet('debug-df/')
 
{code}
 

The above code gets stacked on the ArrowStreamPandasSerializer: (on the first 
line when reading batch from the reader)

 
{code:java}
for batch in reader:
 yield [self.arrow_to_pandas(c) for c in  
pa.Table.from_batches([batch]).itercolumns()]{code}
 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`

2019-07-07 Thread HondaWei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879858#comment-16879858
 ] 

HondaWei commented on SPARK-24077:
--

Hi [~hyukjin.kwon]

Thank you! I am going to trace the code and modify it in the near term if 
[~benedict jin] doesn't work on it.

 

> Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
> --
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>  Labels: starter
>
> The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks 
> confusing: 
> {code}
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> {code}
> {code}
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28268) Rewrite non-correlated Semi/Anti join as Filter

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28268:


Assignee: (was: Apache Spark)

> Rewrite non-correlated Semi/Anti join as Filter
> ---
>
> Key: SPARK-28268
> URL: https://issues.apache.org/jira/browse/SPARK-28268
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mingcong Han
>Priority: Major
>
> When semi/anti join has a non-correlated join condition, we can convert it to 
> a Filter with a non-correlated Exists subquery. As the Exists subquery is 
> non-correlated, we can use a physical plan for it to avoid join. 
>  Actually, this optimization is mainly for the non-correlated subqueries 
> (Exists/In). We currently rewrite Exists/InSubquery as semi/anti/existential 
> join, whether it is correlated or not. And they are mostly executed using a 
> BroadcastNestedLoopJoin which is really not a good choice.
> Here are some examples:
>  1.
> {code:sql}
> SELECT t1a
> FROMt1  
> SEMI JOIN t2
> ON t2a > 10 OR t2b = 'a'
> {code}
> =>
> {code:sql}
> SELECT t1a
> FROM t1
> WHERE EXISTS(SELECT 1 
>  FROM t2 
>  WHERE t2a > 10 OR t2b = 'a')
> {code}
> 2.
> {code:sql}
> SELECT t1a
> FROM  t1
> ANTI JOIN t2
> ON t1b > 10 AND t2b = 'b'
> {code}
> =>
> {code:sql}
> SELECT t1a
> FROM t1
> WHERE NOT(t1b > 10 
>   AND EXISTS(SELECT 1
>  FROM  t2
>  WHERE t2b = 'b'))
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28268) Rewrite non-correlated Semi/Anti join as Filter

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28268:


Assignee: Apache Spark

> Rewrite non-correlated Semi/Anti join as Filter
> ---
>
> Key: SPARK-28268
> URL: https://issues.apache.org/jira/browse/SPARK-28268
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mingcong Han
>Assignee: Apache Spark
>Priority: Major
>
> When semi/anti join has a non-correlated join condition, we can convert it to 
> a Filter with a non-correlated Exists subquery. As the Exists subquery is 
> non-correlated, we can use a physical plan for it to avoid join. 
>  Actually, this optimization is mainly for the non-correlated subqueries 
> (Exists/In). We currently rewrite Exists/InSubquery as semi/anti/existential 
> join, whether it is correlated or not. And they are mostly executed using a 
> BroadcastNestedLoopJoin which is really not a good choice.
> Here are some examples:
>  1.
> {code:sql}
> SELECT t1a
> FROMt1  
> SEMI JOIN t2
> ON t2a > 10 OR t2b = 'a'
> {code}
> =>
> {code:sql}
> SELECT t1a
> FROM t1
> WHERE EXISTS(SELECT 1 
>  FROM t2 
>  WHERE t2a > 10 OR t2b = 'a')
> {code}
> 2.
> {code:sql}
> SELECT t1a
> FROM  t1
> ANTI JOIN t2
> ON t1b > 10 AND t2b = 'b'
> {code}
> =>
> {code:sql}
> SELECT t1a
> FROM t1
> WHERE NOT(t1b > 10 
>   AND EXISTS(SELECT 1
>  FROM  t2
>  WHERE t2b = 'b'))
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28268) Rewrite non-correlated Semi/Anti join as Filter

2019-07-07 Thread Mingcong Han (JIRA)
Mingcong Han created SPARK-28268:


 Summary: Rewrite non-correlated Semi/Anti join as Filter
 Key: SPARK-28268
 URL: https://issues.apache.org/jira/browse/SPARK-28268
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Mingcong Han


When semi/anti join has a non-correlated join condition, we can convert it to a 
Filter with a non-correlated Exists subquery. As the Exists subquery is 
non-correlated, we can use a physical plan for it to avoid join. 
 Actually, this optimization is mainly for the non-correlated subqueries 
(Exists/In). We currently rewrite Exists/InSubquery as semi/anti/existential 
join, whether it is correlated or not. And they are mostly executed using a 
BroadcastNestedLoopJoin which is really not a good choice.

Here are some examples:
 1.
{code:sql}
SELECT t1a
FROMt1  
SEMI JOIN t2
ON t2a > 10 OR t2b = 'a'
{code}
=>
{code:sql}
SELECT t1a
FROM t1
WHERE EXISTS(SELECT 1 
 FROM t2 
 WHERE t2a > 10 OR t2b = 'a')
{code}
2.
{code:sql}
SELECT t1a
FROM  t1
ANTI JOIN t2
ON t1b > 10 AND t2b = 'b'
{code}
=>
{code:sql}
SELECT t1a
FROM t1
WHERE NOT(t1b > 10 
  AND EXISTS(SELECT 1
 FROM  t2
 WHERE t2b = 'b'))
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28267) Update building-spark.md

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28267:


Assignee: Apache Spark

> Update building-spark.md
> 
>
> Key: SPARK-28267
> URL: https://issues.apache.org/jira/browse/SPARK-28267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28267) Update building-spark.md

2019-07-07 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28267:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-23710

> Update building-spark.md
> 
>
> Key: SPARK-28267
> URL: https://issues.apache.org/jira/browse/SPARK-28267
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28267) Update building-spark.md

2019-07-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28267:


Assignee: (was: Apache Spark)

> Update building-spark.md
> 
>
> Key: SPARK-28267
> URL: https://issues.apache.org/jira/browse/SPARK-28267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org