[jira] [Commented] (SPARK-28293) Implement Spark's own GetTableTypesOperation
[ https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880069#comment-16880069 ] Yuming Wang commented on SPARK-28293: - I'm working on > Implement Spark's own GetTableTypesOperation > > > Key: SPARK-28293 > URL: https://issues.apache.org/jira/browse/SPARK-28293 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: Hive-1.2.1.png, Hive-2.3.5.png > > > Build with Hive 1.2.1: > !Hive-1.2.1.png! > Build with Hive 2.3.5: > !Hive-2.3.5.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28293) Implement Spark's own GetTableTypesOperation
[ https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28293: Description: Build with Hive 1.2.1: !Hive-1.2.1.png! Build with Hive 2.3.5: !Hive-2.3.5.png! was: Build with Hive 1.2.1: !image-2019-07-08-14-50-01-831.png! Build with Hive 2.3.5: !image-2019-07-08-14-52-48-963.png! > Implement Spark's own GetTableTypesOperation > > > Key: SPARK-28293 > URL: https://issues.apache.org/jira/browse/SPARK-28293 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: Hive-1.2.1.png, Hive-2.3.5.png > > > Build with Hive 1.2.1: > !Hive-1.2.1.png! > Build with Hive 2.3.5: > !Hive-2.3.5.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28293) Implement Spark's own GetTableTypesOperation
[ https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28293: Attachment: Hive-1.2.1.png > Implement Spark's own GetTableTypesOperation > > > Key: SPARK-28293 > URL: https://issues.apache.org/jira/browse/SPARK-28293 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: Hive-1.2.1.png, Hive-2.3.5.png > > > Build with Hive 1.2.1: > !image-2019-07-08-14-50-01-831.png! > Build with Hive 2.3.5: > !image-2019-07-08-14-52-48-963.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28294) Support `spark.history.fs.cleaner.maxNum` configuration
Dongjoon Hyun created SPARK-28294: - Summary: Support `spark.history.fs.cleaner.maxNum` configuration Key: SPARK-28294 URL: https://issues.apache.org/jira/browse/SPARK-28294 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Dongjoon Hyun Up to now, Apache Spark maintains the event log directory by time policy, `spark.history.fs.cleaner.maxAge`. However, there are two issues. 1. Some file system has a limitation on the maximum number of files in a single directory. For example, HDFS `dfs.namenode.fs-limits.max-directory-items` is 1024 * 1024 by default. - https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml 2. Spark is sometimes unable to to clean up some old log files due to permission issues. To handle both (1) and (2), this issue aims to support an additional number policy configuration for the event log directory, `spark.history.fs.cleaner.maxNum`. Spark can try to keep the number of files in the event log directory according to this policy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28293) Implement Spark's own GetTableTypesOperation
[ https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28293: Attachment: Hive-2.3.5.png > Implement Spark's own GetTableTypesOperation > > > Key: SPARK-28293 > URL: https://issues.apache.org/jira/browse/SPARK-28293 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: Hive-2.3.5.png > > > Build with Hive 1.2.1: > !image-2019-07-08-14-50-01-831.png! > Build with Hive 2.3.5: > !image-2019-07-08-14-52-48-963.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28293) Implement Spark's own GetTableTypesOperation
Yuming Wang created SPARK-28293: --- Summary: Implement Spark's own GetTableTypesOperation Key: SPARK-28293 URL: https://issues.apache.org/jira/browse/SPARK-28293 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Build with Hive 1.2.1: !image-2019-07-08-14-50-01-831.png! Build with Hive 2.3.5: !image-2019-07-08-14-52-48-963.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28292) Enable inject user-defined Hint
[ https://issues.apache.org/jira/browse/SPARK-28292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28292: Assignee: (was: Apache Spark) > Enable inject user-defined Hint > --- > > Key: SPARK-28292 > URL: https://issues.apache.org/jira/browse/SPARK-28292 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: angerszhu >Priority: Major > > We can't inject hint to Analyzer, hope to add a extension entrance to inject > user-defined hint -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28292) Enable inject user-defined Hint
[ https://issues.apache.org/jira/browse/SPARK-28292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28292: Assignee: Apache Spark > Enable inject user-defined Hint > --- > > Key: SPARK-28292 > URL: https://issues.apache.org/jira/browse/SPARK-28292 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > We can't inject hint to Analyzer, hope to add a extension entrance to inject > user-defined hint -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28292) Enable inject user-defined Hint
angerszhu created SPARK-28292: - Summary: Enable inject user-defined Hint Key: SPARK-28292 URL: https://issues.apache.org/jira/browse/SPARK-28292 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: angerszhu We can't inject hint to Analyzer, hope to add a extension entrance to inject user-defined hint -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24497: Summary: ANSI SQL: Recursive query (was: Recursive query) > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880051#comment-16880051 ] Yuming Wang commented on SPARK-24497: - Feature ID: T131 > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24497) Support recursive SQL query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24497: Issue Type: Sub-task (was: New Feature) Parent: SPARK-27764 > Support recursive SQL query > --- > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24497) Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-24497: Summary: Recursive query (was: Support recursive SQL query) > Recursive query > --- > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27951) ANSI SQL: NTH_VALUE function
[ https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880032#comment-16880032 ] Yuming Wang commented on SPARK-27951: - Feature ID: T618 > ANSI SQL: NTH_VALUE function > > > Key: SPARK-27951 > URL: https://issues.apache.org/jira/browse/SPARK-27951 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as > }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of > the window frame (counting from 1); null if no such row| > [https://www.postgresql.org/docs/8.4/functions-window.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27951) ANSI SQL: NTH_VALUE function
[ https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27951: Summary: ANSI SQL: NTH_VALUE function (was: Built-in function: NTH_VALUE) > ANSI SQL: NTH_VALUE function > > > Key: SPARK-27951 > URL: https://issues.apache.org/jira/browse/SPARK-27951 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as > }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of > the window frame (counting from 1); null if no such row| > [https://www.postgresql.org/docs/8.4/functions-window.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28291) UDFs cannot be evaluated within inline table definition
[ https://issues.apache.org/jira/browse/SPARK-28291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28291: - Description: {code} spark.udf.register("udf", (input: Double) => input) sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)") {code} {code} org.apache.spark.sql.AnalysisException: cannot evaluate expression CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 23 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) {code} was: {code} spark.udf.register("udf", (input: Double) => input) sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)") {code} {code} org.apache.spark.sql.AnalysisException: cannot evaluate expression CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) {code} > UDFs cannot be evaluated within inline table definition > --- > > Key: SPARK-28291 > URL: https://issues.apache.org/jira/browse/SPARK-28291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > spark.udf.register("udf", (input: Double) => input) > sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') > AS DOUBLE))) v(x)") > {code} > {code} > org.apache.spark.sql.AnalysisException: cannot evaluate expression > CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 23 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65) > at > org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64) > at scala.collection.mu
[jira] [Updated] (SPARK-28291) UDFs cannot be evaluated within inline table definition
[ https://issues.apache.org/jira/browse/SPARK-28291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28291: - Description: {code} spark.udf.register("udf", (input: Double) => input) sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)") {code} {code} org.apache.spark.sql.AnalysisException: cannot evaluate expression CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) {code} was: {code} spark.udf.register("udf", (input: Double) => input) sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)") {code} {code} org.apache.spark.sql.AnalysisException: cannot evaluate expression CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) {code} > UDFs cannot be evaluated within inline table definition > --- > > Key: SPARK-28291 > URL: https://issues.apache.org/jira/browse/SPARK-28291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > spark.udf.register("udf", (input: Double) => input) > sql("SELECT * FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') > AS DOUBLE))) v(x)") > {code} > {code} > org.ap
[jira] [Updated] (SPARK-28291) UDFs cannot be evaluated within inline table definition
[ https://issues.apache.org/jira/browse/SPARK-28291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28291: - Description: {code} spark.udf.register("udf", (input: Double) => input) sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)") {code} {code} org.apache.spark.sql.AnalysisException: cannot evaluate expression CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) {code} was: {code} spark.udf.register("udf", (input: Double) => input sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)") {code} {code} org.apache.spark.sql.AnalysisException: cannot evaluate expression CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) {code} > UDFs cannot be evaluated within inline table definition > --- > > Key: SPARK-28291 > URL: https://issues.apache.org/jira/browse/SPARK-28291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > spark.udf.register("udf", (input: Double) => input) > sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES >
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, analyze it, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! See https://github.com/apache/spark/pull/25069 as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_n
[jira] [Commented] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880024#comment-16880024 ] Apache Spark commented on SPARK-28273: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/25070 > Convert and port 'pgSQL/case.sql' into UDF test base > > > Key: SPARK-28273 > URL: https://issues.apache.org/jira/browse/SPARK-28273 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-27934 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28273: Assignee: Apache Spark > Convert and port 'pgSQL/case.sql' into UDF test base > > > Key: SPARK-28273 > URL: https://issues.apache.org/jira/browse/SPARK-28273 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > See SPARK-27934 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28273: Assignee: (was: Apache Spark) > Convert and port 'pgSQL/case.sql' into UDF test base > > > Key: SPARK-28273 > URL: https://issues.apache.org/jira/browse/SPARK-28273 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-27934 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880023#comment-16880023 ] Apache Spark commented on SPARK-28273: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/25070 > Convert and port 'pgSQL/case.sql' into UDF test base > > > Key: SPARK-28273 > URL: https://issues.apache.org/jira/browse/SPARK-28273 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-27934 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28270: Assignee: Apache Spark > Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base > > > Key: SPARK-28270 > URL: https://issues.apache.org/jira/browse/SPARK-28270 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > see SPARK-27770 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28270: Assignee: (was: Apache Spark) > Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base > > > Key: SPARK-28270 > URL: https://issues.apache.org/jira/browse/SPARK-28270 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27770 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! See https://github.com/apache/spark/pull/25069 as an example. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "p
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff --no-index sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64",
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator":
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {
[jira] [Created] (SPARK-28291) UDFs cannot be evaluated within inline table definition
Hyukjin Kwon created SPARK-28291: Summary: UDFs cannot be evaluated within inline table definition Key: SPARK-28291 URL: https://issues.apache.org/jira/browse/SPARK-28291 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon {code} spark.udf.register("udf", (input: Double) => input sql("SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES (CAST(udf('1') AS DOUBLE)), (CAST(udf('Infinity') AS DOUBLE))) v(x)") {code} {code} org.apache.spark.sql.AnalysisException: cannot evaluate expression CAST(UDF:udf(1) AS DOUBLE) in inline table definition; line 1 pos 72 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2(ResolveInlineTables.scala:68) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$2$adapted(ResolveInlineTables.scala:65) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1(ResolveInlineTables.scala:65) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$validateInputEvaluable$1$adapted(ResolveInlineTables.scala:64) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.validateInputEvaluable(ResolveInlineTables.scala:64) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:35) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out {code} 6. Compare results with original file, {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}} 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff sql/core/src/test/resources/sql-tests/results/xxx.sql.out sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff xxx.sql.out udf/xxx.sql.out {code} 6. Compare results with original file, {{xxx.sql}} 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} 10. You're ready. Please go for a PR! Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff xxx.sql.out udf/xxx.sql.out {code} 6. Compare results with original file, {{xxx.sql}}. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert.
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff xxx.sql.out udf/xxx.sql.out {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert.
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff xxx.sql.out udf/xxx.sql.out {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if you're able to do this: {code:java} >>> import pandas pan>>> >>> import pandas >>> pandas.__version__ '0.23.4' >>> import pyarrow >>> pyarrow.__version__ '0.13.0' >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]})) pyarrow.Table a: int64 metadata OrderedDict([(b'pandas', b'{"index_columns": [{"kind": "range", "name": null, "start": ' b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,' b' "field_name": null, "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu' b'mpy_type": "int64", "metadata": null}], "creator": {"library' b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')]) {code} 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff xxx.sql.out udf/xxx.sql.out {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in the PR description with the template below: {code:java} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff xxx.sql.out udf/xxx.sql.out {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in the PR description with the template below: {code} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. > Convert applicable *.sql tests into UDF integrated test
[jira] [Updated] (SPARK-28231) Adaptive execution should ignore RepartitionByExpression
[ https://issues.apache.org/jira/browse/SPARK-28231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jrex Ge updated SPARK-28231: Issue Type: Wish (was: Bug) > Adaptive execution should ignore RepartitionByExpression > > > Key: SPARK-28231 > URL: https://issues.apache.org/jira/browse/SPARK-28231 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.1 >Reporter: Jrex Ge >Priority: Major > > Dataset repartitionby will modify the partition information by adaptive > execution -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff # or diff xxx.sql.out udf/xxx.sql.out {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{diff xxx.sql.out udf/xxx.sql.out}} in the PR description with the template below: {code} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. _If possible_ - not required, when you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. in the PR description with the template below: {code} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted.
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. _If possible_ - not required, when you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. in the PR description with the template below: {code} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. in the PR description with the template below: {code} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted. > *Here is the rough contributio
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. in the PR description with the template below: {code} Diff comparing to 'xxx.sql' ```diff ... # here you put 'git diff' results ``` {code} Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. in the PR description. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted. > *Here is the rough contribution guide to follow:* > 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} > 2. Keep the comments and state that this file was copied from {{xxx.sq
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff xxx.sql.out}} between 3. and 5. in the PR description. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff xxx.sql.out}} in the PR description. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted. > *Here is the rough contribution guide to follow:* > 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} > 2. Keep the comments and state that this file was copied from {{xxx.sql}}, > for now. > 3. Run it below: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/u
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert `udf(...)` into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sq}}`. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. 8. Run without generating golden files and check: {code:java} build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" {code} 9. When you open a PR. please attach {{git diff xxx.sql.out}} in the PR description. Note that registered UDFs all return strings - so there are some differences are expected. Note that this JIRA targets plan specific cases in general. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sql}}. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted. > *Here is the rough contribution guide to follow:* > 1. Copy and paste 'xxx.sql' file into {{udf/udf-xxx.sql}} > 2. Keep the comments and state that this file was copied from {{xxx.sql}}, > for now. > 3. Run it below: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git add . > {code} > 4. Insert `udf(...)` into each statement. It is not required to add more > combinations. > And it is not strict about where to insert. > 5. Run it below again: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git diff > {code} > 6. Compare results with original file, {{xxx.sq}}`. If the
[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879980#comment-16879980 ] Sean Owen commented on SPARK-28264: --- I generally like the rationalization of the various UDF types, as they do different things, things which aren't so obvious from the names. Anything we can do to clarify is a win. > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > > See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28290) Use `SslContextFactory.Server` instead of `SslContextFactory`
[ https://issues.apache.org/jira/browse/SPARK-28290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28290: Assignee: Apache Spark > Use `SslContextFactory.Server` instead of `SslContextFactory` > - > > Key: SPARK-28290 > URL: https://issues.apache.org/jira/browse/SPARK-28290 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > `SslContextFactory` is deprecated at Jetty 9.4. This issue replaces it with > `SslContextFactory.Server`. > - > https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html > - > https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28290) Use `SslContextFactory.Server` instead of `SslContextFactory`
[ https://issues.apache.org/jira/browse/SPARK-28290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28290: Assignee: (was: Apache Spark) > Use `SslContextFactory.Server` instead of `SslContextFactory` > - > > Key: SPARK-28290 > URL: https://issues.apache.org/jira/browse/SPARK-28290 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > `SslContextFactory` is deprecated at Jetty 9.4. This issue replaces it with > `SslContextFactory.Server`. > - > https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html > - > https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28290) Use `SslContextFactory.Server` instead of `SslContextFactory`
Dongjoon Hyun created SPARK-28290: - Summary: Use `SslContextFactory.Server` instead of `SslContextFactory` Key: SPARK-28290 URL: https://issues.apache.org/jira/browse/SPARK-28290 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.0.0 Reporter: Dongjoon Hyun `SslContextFactory` is deprecated at Jetty 9.4. This issue replaces it with `SslContextFactory.Server`. - https://www.eclipse.org/jetty/javadoc/9.4.19.v20190610/org/eclipse/jetty/util/ssl/SslContextFactory.html - https://www.eclipse.org/jetty/javadoc/9.3.24.v20180605/org/eclipse/jetty/util/ssl/SslContextFactory.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. *Here is the rough contribution guide to follow:* 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sql}}. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. Here is the rough contribution guide to follow: 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sql}}. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted. > *Here is the rough contribution guide to follow:* > 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} > 2. Keep the comments and state that this file was copied from {{xxx.sql}}, > for now. > 3. Run it below: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git add . > {code} > 4. Insert {{udf(...)}} into each statement. It is not required to add more > combinations. > And it is not strict about where to insert. > 5. Run it below again: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git diff > {code} > 6. Compare results with original file, {{xxx.sql}}. If there are no notable > diff, open a PR. > 7. If there are diff, file or find the JIRA, skip the tests with comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. Here is the rough contribution guide to follow: 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sql}}. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql Namely most of plan related test cases might have to be converted. > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted. > Here is the rough contribution guide to follow: > 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} > 2. Keep the comments and state that this file was copied from {{xxx.sql}}, > for now. > 3. Run it below: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git add . > {code} > 4. Insert {{udf(...)}} into each statement. It is not required to add more > combinations. > And it is not strict about where to insert. > 5. Run it below again: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git diff > {code} > 6. Compare results with original file, {{xxx.sql}}. If there are no notable > diff, open a PR. > 7. If there are diff, file or find the JIRA, skip the tests with comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27921) Convert applicable *.sql tests into UDF integrated test base
[ https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27921: - Description: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. Here is the rough contribution guide to follow: 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sql}}. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. was: This JIRA targets to improve Python test coverage in particular about {{ExtractPythonUDFs}}. This rule has caused many regressions or issues such as SPARK-27803, SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. We should convert *.sql test cases that can be affected by this rule {{ExtractPythonUDFs}} like [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] Namely most of plan related test cases might have to be converted. Here is the rough contribution guide to follow: 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} 2. Keep the comments and state that this file was copied from {{xxx.sql}}, for now. 3. Run it below: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git add . {code} 4. Insert {{udf(...)}} into each statement. It is not required to add more combinations. And it is not strict about where to insert. 5. Run it below again: {code:java} SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql" git diff {code} 6. Compare results with original file, {{xxx.sql}}. If there are no notable diff, open a PR. 7. If there are diff, file or find the JIRA, skip the tests with comments. > Convert applicable *.sql tests into UDF integrated test base > > > Key: SPARK-27921 > URL: https://issues.apache.org/jira/browse/SPARK-27921 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to improve Python test coverage in particular about > {{ExtractPythonUDFs}}. > This rule has caused many regressions or issues such as SPARK-27803, > SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721. > We should convert *.sql test cases that can be affected by this rule > {{ExtractPythonUDFs}} like > [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql] > Namely most of plan related test cases might have to be converted. > Here is the rough contribution guide to follow: > 1. Copy and paste {{xxx.sql}} file into {{udf/udf-xxx.sql}} > 2. Keep the comments and state that this file was copied from {{xxx.sql}}, > for now. > 3. Run it below: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git add . > {code} > 4. Insert {{udf(...)}} into each statement. It is not required to add more > combinations. > And it is not strict about where to insert. > 5. Run it below again: > {code:java} > SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- > -z udf/udf-xxx.sql" > git diff > {code} > 6. Compare results with original file, {{xxx.sql}}. If there are no notable > diff, open a PR. > 7. If there are diff, file or find the JIRA, skip the tests with comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28288) Convert and port 'window.sql' into UDF test base
Hyukjin Kwon created SPARK-28288: Summary: Convert and port 'window.sql' into UDF test base Key: SPARK-28288 URL: https://issues.apache.org/jira/browse/SPARK-28288 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28289) Convert and port 'union.sql' into UDF test base
Hyukjin Kwon created SPARK-28289: Summary: Convert and port 'union.sql' into UDF test base Key: SPARK-28289 URL: https://issues.apache.org/jira/browse/SPARK-28289 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28285) Convert and port 'outer-join.sql' into UDF test base
Hyukjin Kwon created SPARK-28285: Summary: Convert and port 'outer-join.sql' into UDF test base Key: SPARK-28285 URL: https://issues.apache.org/jira/browse/SPARK-28285 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28283) Convert and port 'intersect-all.sql' into UDF test base
Hyukjin Kwon created SPARK-28283: Summary: Convert and port 'intersect-all.sql' into UDF test base Key: SPARK-28283 URL: https://issues.apache.org/jira/browse/SPARK-28283 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28286) Convert and port 'pivot.sql' into UDF test base
Hyukjin Kwon created SPARK-28286: Summary: Convert and port 'pivot.sql' into UDF test base Key: SPARK-28286 URL: https://issues.apache.org/jira/browse/SPARK-28286 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28287) Convert and port 'udaf.sql' into UDF test base
Hyukjin Kwon created SPARK-28287: Summary: Convert and port 'udaf.sql' into UDF test base Key: SPARK-28287 URL: https://issues.apache.org/jira/browse/SPARK-28287 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28284) Convert and port 'join-empty-relation.sql' into UDF test base
Hyukjin Kwon created SPARK-28284: Summary: Convert and port 'join-empty-relation.sql' into UDF test base Key: SPARK-28284 URL: https://issues.apache.org/jira/browse/SPARK-28284 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28280) Convert and port 'group-by.sql' into UDF test base
Hyukjin Kwon created SPARK-28280: Summary: Convert and port 'group-by.sql' into UDF test base Key: SPARK-28280 URL: https://issues.apache.org/jira/browse/SPARK-28280 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28282) Convert and port 'inline-table.sql' into UDF test base
Hyukjin Kwon created SPARK-28282: Summary: Convert and port 'inline-table.sql' into UDF test base Key: SPARK-28282 URL: https://issues.apache.org/jira/browse/SPARK-28282 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28281) Convert and port 'having.sql' into UDF test base
Hyukjin Kwon created SPARK-28281: Summary: Convert and port 'having.sql' into UDF test base Key: SPARK-28281 URL: https://issues.apache.org/jira/browse/SPARK-28281 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28279) Convert and port 'group-analysis.sql' into UDF test base
Hyukjin Kwon created SPARK-28279: Summary: Convert and port 'group-analysis.sql' into UDF test base Key: SPARK-28279 URL: https://issues.apache.org/jira/browse/SPARK-28279 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28277) Convert and port 'except.sql' into UDF test base
Hyukjin Kwon created SPARK-28277: Summary: Convert and port 'except.sql' into UDF test base Key: SPARK-28277 URL: https://issues.apache.org/jira/browse/SPARK-28277 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28276) Convert and port 'cross-join.sql' into UDF test base
Hyukjin Kwon created SPARK-28276: Summary: Convert and port 'cross-join.sql' into UDF test base Key: SPARK-28276 URL: https://issues.apache.org/jira/browse/SPARK-28276 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28278) Convert and port 'except-all.sql' into UDF test base
Hyukjin Kwon created SPARK-28278: Summary: Convert and port 'except-all.sql' into UDF test base Key: SPARK-28278 URL: https://issues.apache.org/jira/browse/SPARK-28278 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28275) Convert and port 'count.sql' into UDF test base
Hyukjin Kwon created SPARK-28275: Summary: Convert and port 'count.sql' into UDF test base Key: SPARK-28275 URL: https://issues.apache.org/jira/browse/SPARK-28275 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28271) Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28271: - Summary: Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base (was: Convert and port 'aggregates_part2.sql' into UDF test base) > Convert and port 'pgSQL/aggregates_part2.sql' into UDF test base > > > Key: SPARK-28271 > URL: https://issues.apache.org/jira/browse/SPARK-28271 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27883 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28272) Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28272: - Summary: Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base (was: Convert and port 'aggregates_part3.sql' into UDF test base) > Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base > > > Key: SPARK-28272 > URL: https://issues.apache.org/jira/browse/SPARK-28272 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27988 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28274) Convert and port 'pgSQL/window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28274: - Summary: Convert and port 'pgSQL/window.sql' into UDF test base (was: Convert and port 'window.sql' into UDF test base) > Convert and port 'pgSQL/window.sql' into UDF test base > -- > > Key: SPARK-28274 > URL: https://issues.apache.org/jira/browse/SPARK-28274 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-23160 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28270) Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28270: - Summary: Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base (was: Convert and port 'aggregates_part1.sql' into UDF test base) > Convert and port 'pgSQL/aggregates_part1.sql' into UDF test base > > > Key: SPARK-28270 > URL: https://issues.apache.org/jira/browse/SPARK-28270 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27770 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28273) Convert and port 'pgSQL/case.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28273: - Summary: Convert and port 'pgSQL/case.sql' into UDF test base (was: Convert and port 'case.sql' into UDF test base) > Convert and port 'pgSQL/case.sql' into UDF test base > > > Key: SPARK-28273 > URL: https://issues.apache.org/jira/browse/SPARK-28273 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See SPARK-27934 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28274) Convert and port 'window.sql' into UDF test base
Hyukjin Kwon created SPARK-28274: Summary: Convert and port 'window.sql' into UDF test base Key: SPARK-28274 URL: https://issues.apache.org/jira/browse/SPARK-28274 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon See SPARK-23160 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28273) Convert and port 'case.sql' into UDF test base
Hyukjin Kwon created SPARK-28273: Summary: Convert and port 'case.sql' into UDF test base Key: SPARK-28273 URL: https://issues.apache.org/jira/browse/SPARK-28273 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon See SPARK-27934 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28272) Convert and port 'aggregates_part3.sql' into UDF test base
Hyukjin Kwon created SPARK-28272: Summary: Convert and port 'aggregates_part3.sql' into UDF test base Key: SPARK-28272 URL: https://issues.apache.org/jira/browse/SPARK-28272 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon see SPARK-27988 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28270) Convert and port 'aggregates_part1.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28270: - Description: see SPARK-27770 > Convert and port 'aggregates_part1.sql' into UDF test base > -- > > Key: SPARK-28270 > URL: https://issues.apache.org/jira/browse/SPARK-28270 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27770 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28271) Convert and port 'aggregates_part2.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28271: - Description: see SPARK-27883 > Convert and port 'aggregates_part2.sql' into UDF test base > -- > > Key: SPARK-28271 > URL: https://issues.apache.org/jira/browse/SPARK-28271 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > see SPARK-27883 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28271) Convert and port 'aggregates_part2.sql' into UDF test base
Hyukjin Kwon created SPARK-28271: Summary: Convert and port 'aggregates_part2.sql' into UDF test base Key: SPARK-28271 URL: https://issues.apache.org/jira/browse/SPARK-28271 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28270) Convert and port 'aggregates_part1.sql' into UDF test base
Hyukjin Kwon created SPARK-28270: Summary: Convert and port 'aggregates_part1.sql' into UDF test base Key: SPARK-28270 URL: https://issues.apache.org/jira/browse/SPARK-28270 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28269) ArrowStreamPandasSerializer get stack
[ https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879975#comment-16879975 ] Hyukjin Kwon commented on SPARK-28269: -- Seems like I can't open the image. Would you be able to specify Pandas, PyArrow, Python version and provide a full reproducer if possible? It would be also better to just show error message with stacktrace (not image) > ArrowStreamPandasSerializer get stack > - > > Key: SPARK-28269 > URL: https://issues.apache.org/jira/browse/SPARK-28269 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Modi Tamam >Priority: Major > Attachments: Untitled.xcf > > > I'm working with Pyspark version 2.4.3. > I have a big data frame: > * ~15M rows > * ~130 columns > * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it > (pandas_df.toPickle() ) resulted with a file of size 2.5GB. > I have some code that groups this data frame and applying a Pandas-UDF: > > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql import functions as F > import pyarrow.parquet as pq > import pyarrow as pa > non_issued_patch="31.7996378000_35.2114362000" > issued_patch = "31.7995787833_35.2121463045" > @pandas_udf("patch_name string", PandasUDFType.GROUPED_MAP) > def foo(pdf): > import pandas as pd > ret_val = pd.DataFrame({'patch_name': [pdf['patch_name'].iloc[0]]}) > return ret_val > full_df=spark.read.parquet('debug-mega-patch') > df = full_df.filter(F.col("grouping_column") == issued_patch).cache() > df.groupBy("grouping_column").apply(foo).repartition(1).write.mode('overwrite').parquet('debug-df/') > > {code} > > The above code gets stacked on the ArrowStreamPandasSerializer: (on the first > line when reading batch from the reader) > > {code:java} > for batch in reader: > yield [self.arrow_to_pandas(c) for c in > pa.Table.from_batches([batch]).itercolumns()]{code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28255) Upgrade dependencies with vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-28255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879973#comment-16879973 ] Hyukjin Kwon commented on SPARK-28255: -- [~bozho], so what do you suggest to upgrade? Users can use higher Hadoop version already. Py4J seems fine - please manually check what's the issue, and report it in Py4J. > Upgrade dependencies with vulnerabilities > - > > Key: SPARK-28255 > URL: https://issues.apache.org/jira/browse/SPARK-28255 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Bozhidar Bozhanov >Priority: Major > > There are severe vulnerabilities in two dependencies: > > [ERROR] hadoop-mapreduce-client-core-2.7.3.jar: CVE-2018-8029, > CVE-2016-6811[ERROR] py4j-0.10.8.1.jar: CVE-2016-5636, CVE-2008-1887 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28255) Upgrade dependencies with vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-28255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28255. -- Resolution: Invalid > Upgrade dependencies with vulnerabilities > - > > Key: SPARK-28255 > URL: https://issues.apache.org/jira/browse/SPARK-28255 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Bozhidar Bozhanov >Priority: Major > > There are severe vulnerabilities in two dependencies: > > [ERROR] hadoop-mapreduce-client-core-2.7.3.jar: CVE-2018-8029, > CVE-2016-6811[ERROR] py4j-0.10.8.1.jar: CVE-2016-5636, CVE-2008-1887 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables
[ https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Chu updated SPARK-28189: - Description: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate', 1),('Monkey', 2),('Ninja', 3),('Spaghetti', 4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga', 1),('Pirate', 2),('Ninja', 3),('Darth Vader', 4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] == df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop('caps') # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. was: Column names in general are case insensitive in Pyspark, and df.drop() in general is also case insensitive. However, when referring to an upstream table, such as from a join, e.g. {code:java} vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)] df1 = spark.createDataFrame(vals1, ['KEY','field']) vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)] df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) df_joined = df1.join(df2, df1['key'] == df2['key'], "left") {code} drop will become case sensitive. e.g. {code:java} # from above, df1 consists of columns ['KEY', 'field'] # from above, df2 consists of columns ['KEY', 'CAPS'] df_joined.select(df2['key']) # will give a result df_joined.drop('caps') # will also give a result {code} however, note the following {code:java} df_joined.drop(df2['key']) # no-op df_joined.drop(df2['caps']) # no-op df_joined.drop(df2['KEY']) # will drop column as expected df_joined.drop(df2['CAPS']) # will drop column as expected {code} so in summary, using df.drop(df2['col']) doesn't align with expected case insensitivity for column names, even though functions like select, join, and dropping a column generally are case insensitive. > Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables > > > Key: SPARK-28189 > URL: https://issues.apache.org/jira/browse/SPARK-28189 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Luke Chu >Assignee: Tony Zhang >Priority: Minor > Fix For: 3.0.0 > > > Column names in general are case insensitive in Pyspark, and df.drop() in > general is also case insensitive. > However, when referring to an upstream table, such as from a join, e.g. > {code:java} > vals1 = [('Pirate', 1),('Monkey', 2),('Ninja', 3),('Spaghetti', 4)] > df1 = spark.createDataFrame(vals1, ['KEY','field']) > vals2 = [('Rutabaga', 1),('Pirate', 2),('Ninja', 3),('Darth Vader', 4)] > df2 = spark.createDataFrame(vals2, ['KEY','CAPS']) > df_joined = df1.join(df2, df1['key'] == df2['key'], "left") > {code} > > drop will become case sensitive. e.g. > {code:java} > # from above, df1 consists of columns ['KEY', 'field'] > # from above, df2 consists of columns ['KEY', 'CAPS'] > df_joined.select(df2['key']) # will give a result > df_joined.drop('caps') # will also give a result > {code} > however, note the following > {code:java} > df_joined.drop(df2['key']) # no-op > df_joined.drop(df2['caps']) # no-op > df_joined.drop(df2['KEY']) # will drop column as expected > df_joined.drop(df2['CAPS']) # will drop column as expected > {code} > > > so in summary, using df.drop(df2['col']) doesn't align with expected case > insensitivity for column names, even though functions like select, join, and > dropping a column generally are case insensitive. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28266) data correctness issue: data duplication when `path` serde peroperty is present
[ https://issues.apache.org/jira/browse/SPARK-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28266: --- Labels: correctness (was: ) > data correctness issue: data duplication when `path` serde peroperty is > present > --- > > Key: SPARK-28266 > URL: https://issues.apache.org/jira/browse/SPARK-28266 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, > 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: correctness > > Spark duplicates returned datasets when `path` serde is present in a parquet > table. > Confirmed versions affected: Spark 2.2, Spark 2.3, Spark 2.4. > Confirmed unaffected versions: Spark 2.1 and earlier (tested with Spark 1.6 > at least). > Reproducer: > {code:python} > >>> spark.sql("create table ruslan_test.test55 as select 1 as id") > DataFrame[] > >>> spark.table("ruslan_test.test55").explain() > == Physical Plan == > HiveTableScan [id#16], HiveTableRelation `ruslan_test`.`test55`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#16] > >>> spark.table("ruslan_test.test55").count() > 1 > {code} > (all is good at this point, now exist session and run in Hive for example - ) > {code:sql} > ALTER TABLE ruslan_test.test55 SET SERDEPROPERTIES ( > 'path'='hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55' ) > {code} > So LOCATION and serde `path` property would point to the same location. > Now see count returns two records instead of one: > {code:python} > >>> spark.table("ruslan_test.test55").count() > 2 > >>> spark.table("ruslan_test.test55").explain() > == Physical Plan == > *(1) FileScan parquet ruslan_test.test55[id#9] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, > hdfs://epsdatalake/hive..., PartitionFilters: [], PushedFilters: [], > ReadSchema: struct > >>> > {code} > Also notice that the presence of `path` serde property makes TABLE location > show up twice - > {quote} > InMemoryFileIndex[hdfs://epsdatalake/hivewarehouse/ruslan_test.db/test55, > hdfs://epsdatalake/hive..., > {quote} > We have some applications that create parquet tables in Hive with `path` > serde property > and it makes data duplicate in query results. > Hive, Impala etc and Spark version 2.1 and earlier read such tables fine, but > not Spark 2.2 and later releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28269) ArrowStreamPandasSerializer get stack
[ https://issues.apache.org/jira/browse/SPARK-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Modi Tamam updated SPARK-28269: --- Attachment: Untitled.xcf > ArrowStreamPandasSerializer get stack > - > > Key: SPARK-28269 > URL: https://issues.apache.org/jira/browse/SPARK-28269 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Modi Tamam >Priority: Major > Attachments: Untitled.xcf > > > I'm working with Pyspark version 2.4.3. > I have a big data frame: > * ~15M rows > * ~130 columns > * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it > (pandas_df.toPickle() ) resulted with a file of size 2.5GB. > I have some code that groups this data frame and applying a Pandas-UDF: > > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql import functions as F > import pyarrow.parquet as pq > import pyarrow as pa > non_issued_patch="31.7996378000_35.2114362000" > issued_patch = "31.7995787833_35.2121463045" > @pandas_udf("patch_name string", PandasUDFType.GROUPED_MAP) > def foo(pdf): > import pandas as pd > ret_val = pd.DataFrame({'patch_name': [pdf['patch_name'].iloc[0]]}) > return ret_val > full_df=spark.read.parquet('debug-mega-patch') > df = full_df.filter(F.col("grouping_column") == issued_patch).cache() > df.groupBy("grouping_column").apply(foo).repartition(1).write.mode('overwrite').parquet('debug-df/') > > {code} > > The above code gets stacked on the ArrowStreamPandasSerializer: (on the first > line when reading batch from the reader) > > {code:java} > for batch in reader: > yield [self.arrow_to_pandas(c) for c in > pa.Table.from_batches([batch]).itercolumns()]{code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28269) ArrowStreamPandasSerializer get stack
Modi Tamam created SPARK-28269: -- Summary: ArrowStreamPandasSerializer get stack Key: SPARK-28269 URL: https://issues.apache.org/jira/browse/SPARK-28269 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.3 Reporter: Modi Tamam I'm working with Pyspark version 2.4.3. I have a big data frame: * ~15M rows * ~130 columns * ~2.5 GB - I've converted it to a Pandas data frame, then, pickling it (pandas_df.toPickle() ) resulted with a file of size 2.5GB. I have some code that groups this data frame and applying a Pandas-UDF: {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql import functions as F import pyarrow.parquet as pq import pyarrow as pa non_issued_patch="31.7996378000_35.2114362000" issued_patch = "31.7995787833_35.2121463045" @pandas_udf("patch_name string", PandasUDFType.GROUPED_MAP) def foo(pdf): import pandas as pd ret_val = pd.DataFrame({'patch_name': [pdf['patch_name'].iloc[0]]}) return ret_val full_df=spark.read.parquet('debug-mega-patch') df = full_df.filter(F.col("grouping_column") == issued_patch).cache() df.groupBy("grouping_column").apply(foo).repartition(1).write.mode('overwrite').parquet('debug-df/') {code} The above code gets stacked on the ArrowStreamPandasSerializer: (on the first line when reading batch from the reader) {code:java} for batch in reader: yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24077) Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS`
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879858#comment-16879858 ] HondaWei commented on SPARK-24077: -- Hi [~hyukjin.kwon] Thank you! I am going to trace the code and modify it in the near term if [~benedict jin] doesn't work on it. > Issue a better error message for `CREATE TEMPORARY FUNCTION IF NOT EXISTS` > -- > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > Labels: starter > > The error message of {{CREATE TEMPORARY FUNCTION IF NOT EXISTS}} looks > confusing: > {code} > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > {code} > {code} > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28268) Rewrite non-correlated Semi/Anti join as Filter
[ https://issues.apache.org/jira/browse/SPARK-28268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28268: Assignee: (was: Apache Spark) > Rewrite non-correlated Semi/Anti join as Filter > --- > > Key: SPARK-28268 > URL: https://issues.apache.org/jira/browse/SPARK-28268 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mingcong Han >Priority: Major > > When semi/anti join has a non-correlated join condition, we can convert it to > a Filter with a non-correlated Exists subquery. As the Exists subquery is > non-correlated, we can use a physical plan for it to avoid join. > Actually, this optimization is mainly for the non-correlated subqueries > (Exists/In). We currently rewrite Exists/InSubquery as semi/anti/existential > join, whether it is correlated or not. And they are mostly executed using a > BroadcastNestedLoopJoin which is really not a good choice. > Here are some examples: > 1. > {code:sql} > SELECT t1a > FROMt1 > SEMI JOIN t2 > ON t2a > 10 OR t2b = 'a' > {code} > => > {code:sql} > SELECT t1a > FROM t1 > WHERE EXISTS(SELECT 1 > FROM t2 > WHERE t2a > 10 OR t2b = 'a') > {code} > 2. > {code:sql} > SELECT t1a > FROM t1 > ANTI JOIN t2 > ON t1b > 10 AND t2b = 'b' > {code} > => > {code:sql} > SELECT t1a > FROM t1 > WHERE NOT(t1b > 10 > AND EXISTS(SELECT 1 > FROM t2 > WHERE t2b = 'b')) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28268) Rewrite non-correlated Semi/Anti join as Filter
[ https://issues.apache.org/jira/browse/SPARK-28268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28268: Assignee: Apache Spark > Rewrite non-correlated Semi/Anti join as Filter > --- > > Key: SPARK-28268 > URL: https://issues.apache.org/jira/browse/SPARK-28268 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mingcong Han >Assignee: Apache Spark >Priority: Major > > When semi/anti join has a non-correlated join condition, we can convert it to > a Filter with a non-correlated Exists subquery. As the Exists subquery is > non-correlated, we can use a physical plan for it to avoid join. > Actually, this optimization is mainly for the non-correlated subqueries > (Exists/In). We currently rewrite Exists/InSubquery as semi/anti/existential > join, whether it is correlated or not. And they are mostly executed using a > BroadcastNestedLoopJoin which is really not a good choice. > Here are some examples: > 1. > {code:sql} > SELECT t1a > FROMt1 > SEMI JOIN t2 > ON t2a > 10 OR t2b = 'a' > {code} > => > {code:sql} > SELECT t1a > FROM t1 > WHERE EXISTS(SELECT 1 > FROM t2 > WHERE t2a > 10 OR t2b = 'a') > {code} > 2. > {code:sql} > SELECT t1a > FROM t1 > ANTI JOIN t2 > ON t1b > 10 AND t2b = 'b' > {code} > => > {code:sql} > SELECT t1a > FROM t1 > WHERE NOT(t1b > 10 > AND EXISTS(SELECT 1 > FROM t2 > WHERE t2b = 'b')) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28268) Rewrite non-correlated Semi/Anti join as Filter
Mingcong Han created SPARK-28268: Summary: Rewrite non-correlated Semi/Anti join as Filter Key: SPARK-28268 URL: https://issues.apache.org/jira/browse/SPARK-28268 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Mingcong Han When semi/anti join has a non-correlated join condition, we can convert it to a Filter with a non-correlated Exists subquery. As the Exists subquery is non-correlated, we can use a physical plan for it to avoid join. Actually, this optimization is mainly for the non-correlated subqueries (Exists/In). We currently rewrite Exists/InSubquery as semi/anti/existential join, whether it is correlated or not. And they are mostly executed using a BroadcastNestedLoopJoin which is really not a good choice. Here are some examples: 1. {code:sql} SELECT t1a FROMt1 SEMI JOIN t2 ON t2a > 10 OR t2b = 'a' {code} => {code:sql} SELECT t1a FROM t1 WHERE EXISTS(SELECT 1 FROM t2 WHERE t2a > 10 OR t2b = 'a') {code} 2. {code:sql} SELECT t1a FROM t1 ANTI JOIN t2 ON t1b > 10 AND t2b = 'b' {code} => {code:sql} SELECT t1a FROM t1 WHERE NOT(t1b > 10 AND EXISTS(SELECT 1 FROM t2 WHERE t2b = 'b')) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28267) Update building-spark.md
[ https://issues.apache.org/jira/browse/SPARK-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28267: Assignee: Apache Spark > Update building-spark.md > > > Key: SPARK-28267 > URL: https://issues.apache.org/jira/browse/SPARK-28267 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28267) Update building-spark.md
[ https://issues.apache.org/jira/browse/SPARK-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28267: Issue Type: Sub-task (was: Improvement) Parent: SPARK-23710 > Update building-spark.md > > > Key: SPARK-28267 > URL: https://issues.apache.org/jira/browse/SPARK-28267 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28267) Update building-spark.md
[ https://issues.apache.org/jira/browse/SPARK-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28267: Assignee: (was: Apache Spark) > Update building-spark.md > > > Key: SPARK-28267 > URL: https://issues.apache.org/jira/browse/SPARK-28267 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org