[jira] [Commented] (SPARK-32481) Support truncate table to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237224#comment-17237224 ] Hyukjin Kwon commented on SPARK-32481: -- Reverted in https://github.com/apache/spark/pull/30463 > Support truncate table to move the data to trash > > > Key: SPARK-32481 > URL: https://issues.apache.org/jira/browse/SPARK-32481 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Assignee: Udbhav Agrawal >Priority: Minor > Fix For: 3.1.0 > > > *Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32481) Support truncate table to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32481: - Fix Version/s: (was: 3.1.0) > Support truncate table to move the data to trash > > > Key: SPARK-32481 > URL: https://issues.apache.org/jira/browse/SPARK-32481 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Assignee: Udbhav Agrawal >Priority: Minor > > *Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32481) Support truncate table to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-32481: -- Assignee: (was: Udbhav Agrawal) > Support truncate table to move the data to trash > > > Key: SPARK-32481 > URL: https://issues.apache.org/jira/browse/SPARK-32481 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > > *Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32481) Support truncate table to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32481: Assignee: Apache Spark > Support truncate table to move the data to trash > > > Key: SPARK-32481 > URL: https://issues.apache.org/jira/browse/SPARK-32481 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Assignee: Apache Spark >Priority: Minor > > *Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32481) Support truncate table to move the data to trash
[ https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32481: Assignee: (was: Apache Spark) > Support truncate table to move the data to trash > > > Key: SPARK-32481 > URL: https://issues.apache.org/jira/browse/SPARK-32481 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > > *Instead of deleting the data, move the data to trash.So from trash based on > configuration data can be deleted permanently.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33515) Improve exception messages while handling UnresolvedTable
[ https://issues.apache.org/jira/browse/SPARK-33515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33515: --- Assignee: Terry Kim > Improve exception messages while handling UnresolvedTable > - > > Key: SPARK-33515 > URL: https://issues.apache.org/jira/browse/SPARK-33515 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > > Improve exception messages while handling UnresolvedTable by adding command > name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33515) Improve exception messages while handling UnresolvedTable
[ https://issues.apache.org/jira/browse/SPARK-33515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33515. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30461 [https://github.com/apache/spark/pull/30461] > Improve exception messages while handling UnresolvedTable > - > > Key: SPARK-33515 > URL: https://issues.apache.org/jira/browse/SPARK-33515 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.1.0 > > > Improve exception messages while handling UnresolvedTable by adding command > name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33511) Respect case sensitivity in resolving partition specs V2
[ https://issues.apache.org/jira/browse/SPARK-33511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33511. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30454 [https://github.com/apache/spark/pull/30454] > Respect case sensitivity in resolving partition specs V2 > > > Key: SPARK-33511 > URL: https://issues.apache.org/jira/browse/SPARK-33511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > DSv1 DDL commands respect the SQL config spark.sql.caseSensitive, for example > {code:java} > spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet > PARTITIONED BY (id); > spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1); > spark-sql> SHOW PARTITIONS tbl1; > id=1 > {code} > but the same ALTER TABLE command fails on DSv2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33511) Respect case sensitivity in resolving partition specs V2
[ https://issues.apache.org/jira/browse/SPARK-33511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33511: --- Assignee: Maxim Gekk > Respect case sensitivity in resolving partition specs V2 > > > Key: SPARK-33511 > URL: https://issues.apache.org/jira/browse/SPARK-33511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > DSv1 DDL commands respect the SQL config spark.sql.caseSensitive, for example > {code:java} > spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet > PARTITIONED BY (id); > spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1); > spark-sql> SHOW PARTITIONS tbl1; > id=1 > {code} > but the same ALTER TABLE command fails on DSv2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liucht-inspur updated SPARK-33517: -- Description: Error setting menu item and link, change "Apache Arrow in Spark" to "Apache Arrow in PySpark" !image-2020-11-23-18-47-01-591.png! was: Error setting menu item and link, change "Apache Arrow in Spark" to "Apache Arrow in PySpark" > Incorrect menu item display and link in PySpark Usage Guide for Pandas with > Apache Arrow > > > Key: SPARK-33517 > URL: https://issues.apache.org/jira/browse/SPARK-33517 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: liucht-inspur >Priority: Minor > Attachments: image-2020-11-23-18-47-01-591.png, spark-doc.jpg > > > Error setting menu item and link, change "Apache Arrow in Spark" to "Apache > Arrow in PySpark" > !image-2020-11-23-18-47-01-591.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liucht-inspur updated SPARK-33517: -- Attachment: image-2020-11-23-18-47-01-591.png > Incorrect menu item display and link in PySpark Usage Guide for Pandas with > Apache Arrow > > > Key: SPARK-33517 > URL: https://issues.apache.org/jira/browse/SPARK-33517 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.0.1 >Reporter: liucht-inspur >Priority: Minor > Attachments: image-2020-11-23-18-47-01-591.png, spark-doc.jpg > > > Error setting menu item and link, change "Apache Arrow in Spark" to "Apache > Arrow in PySpark" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV
zhengruifeng created SPARK-33518: Summary: Improve performance of ML ALS recommendForAll by GEMV Key: SPARK-33518 URL: https://issues.apache.org/jira/browse/SPARK-33518 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.2.0 Reporter: zhengruifeng There were a lot of works on improving ALS's {{recommendForAll}} For now, I found that it maybe futhermore optimized by 1, using GEMV; 2, directly aggregate on topK collections (srcId, Array(dstId), Array(score)), instead of each element (srcId, (dstId, score)); 3, use guava.ordering instead of BoundedPriorityQueue; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV
[ https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33518: Assignee: Apache Spark > Improve performance of ML ALS recommendForAll by GEMV > - > > Key: SPARK-33518 > URL: https://issues.apache.org/jira/browse/SPARK-33518 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > > There were a lot of works on improving ALS's {{recommendForAll}} > For now, I found that it maybe futhermore optimized by > 1, using GEMV; > 2, directly aggregate on topK collections (srcId, Array(dstId), > Array(score)), instead of each element (srcId, (dstId, score)); > 3, use guava.ordering instead of BoundedPriorityQueue; > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV
[ https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237279#comment-17237279 ] Apache Spark commented on SPARK-33518: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/30468 > Improve performance of ML ALS recommendForAll by GEMV > - > > Key: SPARK-33518 > URL: https://issues.apache.org/jira/browse/SPARK-33518 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Major > > There were a lot of works on improving ALS's {{recommendForAll}} > For now, I found that it maybe futhermore optimized by > 1, using GEMV; > 2, directly aggregate on topK collections (srcId, Array(dstId), > Array(score)), instead of each element (srcId, (dstId, score)); > 3, use guava.ordering instead of BoundedPriorityQueue; > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV
[ https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33518: Assignee: (was: Apache Spark) > Improve performance of ML ALS recommendForAll by GEMV > - > > Key: SPARK-33518 > URL: https://issues.apache.org/jira/browse/SPARK-33518 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Major > > There were a lot of works on improving ALS's {{recommendForAll}} > For now, I found that it maybe futhermore optimized by > 1, using GEMV; > 2, directly aggregate on topK collections (srcId, Array(dstId), > Array(score)), instead of each element (srcId, (dstId, score)); > 3, use guava.ordering instead of BoundedPriorityQueue; > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV
[ https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237280#comment-17237280 ] Apache Spark commented on SPARK-33518: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/30468 > Improve performance of ML ALS recommendForAll by GEMV > - > > Key: SPARK-33518 > URL: https://issues.apache.org/jira/browse/SPARK-33518 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Major > > There were a lot of works on improving ALS's {{recommendForAll}} > For now, I found that it maybe futhermore optimized by > 1, using GEMV; > 2, directly aggregate on topK collections (srcId, Array(dstId), > Array(score)), instead of each element (srcId, (dstId, score)); > 3, use guava.ordering instead of BoundedPriorityQueue; > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33479) Make apiKey of docsearch configurable
[ https://issues.apache.org/jira/browse/SPARK-33479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237318#comment-17237318 ] Apache Spark commented on SPARK-33479: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30469 > Make apiKey of docsearch configurable > - > > Key: SPARK-33479 > URL: https://issues.apache.org/jira/browse/SPARK-33479 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.1.0 > > > After https://github.com/apache/spark/pull/30292, our Spark documentation > site supports searching. > However, the default API key always points to the latest release doc. We have > to set different API keys for different releases. Otherwise, the search > results are always based on the latest > documentation(https://spark.apache.org/docs/latest/) even when visiting the > documentation of previous releases. > As per discussion in > https://github.com/apache/spark/pull/30292#issuecomment-725613417, we should > make the API key configurable and avoid hardcoding in the HTML template. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33479) Make apiKey of docsearch configurable
[ https://issues.apache.org/jira/browse/SPARK-33479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237319#comment-17237319 ] Apache Spark commented on SPARK-33479: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30469 > Make apiKey of docsearch configurable > - > > Key: SPARK-33479 > URL: https://issues.apache.org/jira/browse/SPARK-33479 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.1.0 > > > After https://github.com/apache/spark/pull/30292, our Spark documentation > site supports searching. > However, the default API key always points to the latest release doc. We have > to set different API keys for different releases. Otherwise, the search > results are always based on the latest > documentation(https://spark.apache.org/docs/latest/) even when visiting the > documentation of previous releases. > As per discussion in > https://github.com/apache/spark/pull/30292#issuecomment-725613417, we should > make the API key configurable and avoid hardcoding in the HTML template. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32792) Improve in filter pushdown for ParquetFilters
[ https://issues.apache.org/jira/browse/SPARK-32792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32792: Description: Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` maximum value when its values exceeds `spark.sql.parquet.pushdown.inFilterThreshold`. For example: ```sql SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15) ``` We will push down `id >= 1 and id <= 15`. was: [https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L602] {code:scala} case sources.In(name, values) if canMakeFilterOn(name, values.head) && values.distinct.length <= pushDownInFilterThreshold => values.distinct.flatMap { v => {code} *distinct* is expensive > Improve in filter pushdown for ParquetFilters > - > > Key: SPARK-32792 > URL: https://issues.apache.org/jira/browse/SPARK-32792 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` > maximum value when its values exceeds > `spark.sql.parquet.pushdown.inFilterThreshold`. For example: > ```sql > SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15) > ``` > We will push down `id >= 1 and id <= 15`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33519) Batch UDF in scala
Gaetan created SPARK-33519: -- Summary: Batch UDF in scala Key: SPARK-33519 URL: https://issues.apache.org/jira/browse/SPARK-33519 Project: Spark Issue Type: Wish Components: Optimizer, Spark Core Affects Versions: 3.0.1, 3.0.0 Reporter: Gaetan Hello, Contrary to Python, there is only one type of Scala UDF, that let us define a Scala function to apply on a set of Column and which is called +for each row+. One advantage of Scala UDF over mapPartitions is that Catalyst is able to see what are the inputs which are then used for column pruning, predicate pushdown and other optimization rules. But in some use cases, there can be a setup phase that we only want to execute once per worker right before processing inputs. For such use cases, Scala UDF is not well suited and mapPartitions is used instead like this: {code:java} ds.mapPartitions( it => { setup() process(it) } ){code} After having looked at the code, I figured that Python UDF are implemented via query plans that retrieve a RDD via their children and that call mapPartitions of that RDD to work with batches of inputs. These query plans are generated by Catalyst by extracting Python UDFs (rule ExtractPythonUDFs). Like for Python UDFs, we could implement Scala batch UDFs with query plans to work with a batch of inputs instead of one input. What do you think ? Here is a very small description of one of our use cases of Spark that could greatly benefit from Scala batch UDFs: We are using Spark to distribute some computation run in C#. To do so, we call the method mapPartitions of the DataFrame that represents our data. Inside mapPartitions, we: * First connect to the C# process * Then iterate over the inputs by sending each input to the C# process and by getting back the results. The use of mapPartitions was motivated by the setup (connection to the C# process) that happens for each partition. Now that we have a first working version, we would like to improve it by limiting the columns to read. We don't want to select columns that are required by our computation right before the mapPartitions because it would result in filtering out columns that could be required by other transformations in the workflow. Instead, we would like to take advantage of Catalyst for column pruning, predict pushdowns and other optimization rules. Using a Scala UDF to replace the mapPartitions would not be efficient because we would connect to the C# process for each row. An alternative would be a Scala "batch" UDF which would be applied on the columns that are needed for our computation, to take advantage of Catalyst and its optimizing rules, and which input would be an iterator like mapPartitions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33495) jcl-over-slf4j conflicts with commons-logging.jar
[ https://issues.apache.org/jira/browse/SPARK-33495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237402#comment-17237402 ] Apache Spark commented on SPARK-33495: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30470 > jcl-over-slf4j conflicts with commons-logging.jar > - > > Key: SPARK-33495 > URL: https://issues.apache.org/jira/browse/SPARK-33495 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: lrz >Priority: Minor > > spark had introduces jcl-over-slf4j as the bridge between commons-logging and > slf4j. And refer to: > [https://jira.qos.ch/browse/SLF4J-250?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel] > [http://www.slf4j.org/legacy.html] > because jcl-over-slf4j.jar contains duplicates classes with > commons-logging.jar, so it's better to remove the dependency of > commons-logging. > And we also find one deadlock issue cause by jcl-over-slf4j.jar and > commons-logging.jar coexistence. this issue happen during the cinit of > LogFactory and SLF4JlogFactory -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33495) jcl-over-slf4j conflicts with commons-logging.jar
[ https://issues.apache.org/jira/browse/SPARK-33495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33495: Assignee: (was: Apache Spark) > jcl-over-slf4j conflicts with commons-logging.jar > - > > Key: SPARK-33495 > URL: https://issues.apache.org/jira/browse/SPARK-33495 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: lrz >Priority: Minor > > spark had introduces jcl-over-slf4j as the bridge between commons-logging and > slf4j. And refer to: > [https://jira.qos.ch/browse/SLF4J-250?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel] > [http://www.slf4j.org/legacy.html] > because jcl-over-slf4j.jar contains duplicates classes with > commons-logging.jar, so it's better to remove the dependency of > commons-logging. > And we also find one deadlock issue cause by jcl-over-slf4j.jar and > commons-logging.jar coexistence. this issue happen during the cinit of > LogFactory and SLF4JlogFactory -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33495) jcl-over-slf4j conflicts with commons-logging.jar
[ https://issues.apache.org/jira/browse/SPARK-33495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33495: Assignee: Apache Spark > jcl-over-slf4j conflicts with commons-logging.jar > - > > Key: SPARK-33495 > URL: https://issues.apache.org/jira/browse/SPARK-33495 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: lrz >Assignee: Apache Spark >Priority: Minor > > spark had introduces jcl-over-slf4j as the bridge between commons-logging and > slf4j. And refer to: > [https://jira.qos.ch/browse/SLF4J-250?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel] > [http://www.slf4j.org/legacy.html] > because jcl-over-slf4j.jar contains duplicates classes with > commons-logging.jar, so it's better to remove the dependency of > commons-logging. > And we also find one deadlock issue cause by jcl-over-slf4j.jar and > commons-logging.jar coexistence. this issue happen during the cinit of > LogFactory and SLF4JlogFactory -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33495) jcl-over-slf4j conflicts with commons-logging.jar
[ https://issues.apache.org/jira/browse/SPARK-33495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237403#comment-17237403 ] Apache Spark commented on SPARK-33495: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30470 > jcl-over-slf4j conflicts with commons-logging.jar > - > > Key: SPARK-33495 > URL: https://issues.apache.org/jira/browse/SPARK-33495 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: lrz >Priority: Minor > > spark had introduces jcl-over-slf4j as the bridge between commons-logging and > slf4j. And refer to: > [https://jira.qos.ch/browse/SLF4J-250?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel] > [http://www.slf4j.org/legacy.html] > because jcl-over-slf4j.jar contains duplicates classes with > commons-logging.jar, so it's better to remove the dependency of > commons-logging. > And we also find one deadlock issue cause by jcl-over-slf4j.jar and > commons-logging.jar coexistence. this issue happen during the cinit of > LogFactory and SLF4JlogFactory -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.
[ https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32221: Assignee: Apache Spark > Avoid possible errors due to incorrect file size or type supplied in spark > conf. > > > Key: SPARK-32221 > URL: https://issues.apache.org/jira/browse/SPARK-32221 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Apache Spark >Priority: Major > > This would avoid failures, in case the files are a bit large or a user places > a binary file inside the SPARK_CONF_DIR. > Both of which are not supported at the moment. > The reason is, underlying etcd store does limit the size of each entry to > only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which > allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s > clusters, then we can hope to overcome this limitation. e.g. > [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for > higher limit on each entry. > Even if that does not happen, there are other ways to overcome this > limitation, for example, we can have config files split across multiple > configMaps. We need to discuss, and prioritise, this issue takes the > straightforward approach of skipping files that cannot be accommodated within > 1.5MiB limit and WARNING the user about the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.
[ https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237448#comment-17237448 ] Apache Spark commented on SPARK-32221: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/30472 > Avoid possible errors due to incorrect file size or type supplied in spark > conf. > > > Key: SPARK-32221 > URL: https://issues.apache.org/jira/browse/SPARK-32221 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > This would avoid failures, in case the files are a bit large or a user places > a binary file inside the SPARK_CONF_DIR. > Both of which are not supported at the moment. > The reason is, underlying etcd store does limit the size of each entry to > only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which > allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s > clusters, then we can hope to overcome this limitation. e.g. > [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for > higher limit on each entry. > Even if that does not happen, there are other ways to overcome this > limitation, for example, we can have config files split across multiple > configMaps. We need to discuss, and prioritise, this issue takes the > straightforward approach of skipping files that cannot be accommodated within > 1.5MiB limit and WARNING the user about the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.
[ https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32221: Assignee: (was: Apache Spark) > Avoid possible errors due to incorrect file size or type supplied in spark > conf. > > > Key: SPARK-32221 > URL: https://issues.apache.org/jira/browse/SPARK-32221 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > This would avoid failures, in case the files are a bit large or a user places > a binary file inside the SPARK_CONF_DIR. > Both of which are not supported at the moment. > The reason is, underlying etcd store does limit the size of each entry to > only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which > allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s > clusters, then we can hope to overcome this limitation. e.g. > [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for > higher limit on each entry. > Even if that does not happen, there are other ways to overcome this > limitation, for example, we can have config files split across multiple > configMaps. We need to discuss, and prioritise, this issue takes the > straightforward approach of skipping files that cannot be accommodated within > 1.5MiB limit and WARNING the user about the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.
[ https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237449#comment-17237449 ] Apache Spark commented on SPARK-32221: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/30472 > Avoid possible errors due to incorrect file size or type supplied in spark > conf. > > > Key: SPARK-32221 > URL: https://issues.apache.org/jira/browse/SPARK-32221 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > This would avoid failures, in case the files are a bit large or a user places > a binary file inside the SPARK_CONF_DIR. > Both of which are not supported at the moment. > The reason is, underlying etcd store does limit the size of each entry to > only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which > allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s > clusters, then we can hope to overcome this limitation. e.g. > [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for > higher limit on each entry. > Even if that does not happen, there are other ways to overcome this > limitation, for example, we can have config files split across multiple > configMaps. We need to discuss, and prioritise, this issue takes the > straightforward approach of skipping files that cannot be accommodated within > 1.5MiB limit and WARNING the user about the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33520) make CrossValidator/TrainValidateSplit support Python backend estimator/model
Weichen Xu created SPARK-33520: -- Summary: make CrossValidator/TrainValidateSplit support Python backend estimator/model Key: SPARK-33520 URL: https://issues.apache.org/jira/browse/SPARK-33520 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 3.1.0 Reporter: Weichen Xu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33520) make CrossValidator/TrainValidateSplit support Python backend estimator/model
[ https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33520: Assignee: Apache Spark > make CrossValidator/TrainValidateSplit support Python backend estimator/model > - > > Key: SPARK-33520 > URL: https://issues.apache.org/jira/browse/SPARK-33520 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33520) make CrossValidator/TrainValidateSplit support Python backend estimator/model
[ https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237455#comment-17237455 ] Apache Spark commented on SPARK-33520: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/30471 > make CrossValidator/TrainValidateSplit support Python backend estimator/model > - > > Key: SPARK-33520 > URL: https://issues.apache.org/jira/browse/SPARK-33520 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33520) make CrossValidator/TrainValidateSplit support Python backend estimator/model
[ https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33520: Assignee: (was: Apache Spark) > make CrossValidator/TrainValidateSplit support Python backend estimator/model > - > > Key: SPARK-33520 > URL: https://issues.apache.org/jira/browse/SPARK-33520 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33519) Batch UDF in scala
[ https://issues.apache.org/jira/browse/SPARK-33519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaetan updated SPARK-33519: --- Issue Type: New Feature (was: Wish) > Batch UDF in scala > -- > > Key: SPARK-33519 > URL: https://issues.apache.org/jira/browse/SPARK-33519 > Project: Spark > Issue Type: New Feature > Components: Optimizer, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Gaetan >Priority: Major > > Hello, > Contrary to Python, there is only one type of Scala UDF, that let us define a > Scala function to apply on a set of Column and which is called +for each > row+. One advantage of Scala UDF over mapPartitions is that Catalyst is able > to see what are the inputs which are then used for column pruning, predicate > pushdown and other optimization rules. But in some use cases, there can be a > setup phase that we only want to execute once per worker right before > processing inputs. For such use cases, Scala UDF is not well suited and > mapPartitions is used instead like this: > > {code:java} > ds.mapPartitions( > it => { > setup() > process(it) > } > ){code} > After having looked at the code, I figured that Python UDF are implemented > via query plans that retrieve a RDD via their children and that call > mapPartitions of that RDD to work with batches of inputs. These query plans > are generated by Catalyst by extracting Python UDFs (rule ExtractPythonUDFs). > > Like for Python UDFs, we could implement Scala batch UDFs with query plans to > work with a batch of inputs instead of one input. What do you think ? > Here is a very small description of one of our use cases of Spark that could > greatly benefit from Scala batch UDFs: > We are using Spark to distribute some computation run in C#. To do so, we > call the method mapPartitions of the DataFrame that represents our data. > Inside mapPartitions, we: > * First connect to the C# process > * Then iterate over the inputs by sending each input to the C# process and > by getting back the results. > The use of mapPartitions was motivated by the setup (connection to the C# > process) that happens for each partition. > Now that we have a first working version, we would like to improve it by > limiting the columns to read. We don't want to select columns that are > required by our computation right before the mapPartitions because it would > result in filtering out columns that could be required by other > transformations in the workflow. Instead, we would like to take advantage of > Catalyst for column pruning, predict pushdowns and other optimization rules. > Using a Scala UDF to replace the mapPartitions would not be efficient because > we would connect to the C# process for each row. An alternative would be a > Scala "batch" UDF which would be applied on the columns that are needed for > our computation, to take advantage of Catalyst and its optimizing rules, and > which input would be an iterator like mapPartitions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33519) Batch UDF in scala
[ https://issues.apache.org/jira/browse/SPARK-33519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaetan updated SPARK-33519: --- Description: Hello, Contrary to Python, there is only one type of Scala UDF, that let us define a Scala function to apply on a set of Column and which is called +for each row+. One advantage of Scala UDF over mapPartitions is that Catalyst is able to see what are the inputs which are then used for column pruning, predicate pushdown and other optimization rules. But in some use cases, there can be a setup phase that we only want to execute once per worker right before processing inputs. For such use cases, Scala UDF is not well suited and mapPartitions is used instead like this: {code:java} ds.mapPartitions( it => { setup() process(it) } ){code} After having looked at the code, I figured that Python UDF are implemented via query plans that retrieve a RDD via their children and that call mapPartitions of that RDD to work with batches of inputs. These query plans are generated by Catalyst by extracting Python UDFs (rule ExtractPythonUDFs). *Implementation details*: we could implement a new Expression ScalaBatchUDF and add a boolean isBatch to Expression that tells whether an Expression is batch or not. SparkPlan SelectExec, FilterExec (and probably more) would be modified to handle batch Expression: * Generated code would include code that call batch Expression with batch of inputs instead of one single input. * doExecute() method will call batch Expression with batch of inputs instead of one single input. A SparkPlan could be composed of "single" Expressions and batch expressions. It is a first idea that would need to be refined. What do you think ? Here is a very small description of *one of our use cases of Spark* that could greatly benefit from Scala batch UDFs: We are using Spark to distribute some computation run in C#. To do so, we call the method mapPartitions of the DataFrame that represents our data. Inside mapPartitions, we: * First connect to the C# process * Then iterate over the inputs by sending each input to the C# process and by getting back the results. The use of mapPartitions was motivated by the setup (connection to the C# process) that happens for each partition. Now that we have a first working version, we would like to improve it by limiting the columns to read. We don't want to select columns that are required by our computation right before the mapPartitions because it would result in filtering out columns that could be required by other transformations in the workflow. Instead, we would like to take advantage of Catalyst for column pruning, predict pushdowns and other optimization rules. Using a Scala UDF to replace the mapPartitions would not be efficient because we would connect to the C# process for each row. An alternative would be a Scala "batch" UDF which would be applied on the columns that are needed for our computation, to take advantage of Catalyst and its optimizing rules, and which input would be an iterator like mapPartitions. was: Hello, Contrary to Python, there is only one type of Scala UDF, that let us define a Scala function to apply on a set of Column and which is called +for each row+. One advantage of Scala UDF over mapPartitions is that Catalyst is able to see what are the inputs which are then used for column pruning, predicate pushdown and other optimization rules. But in some use cases, there can be a setup phase that we only want to execute once per worker right before processing inputs. For such use cases, Scala UDF is not well suited and mapPartitions is used instead like this: {code:java} ds.mapPartitions( it => { setup() process(it) } ){code} After having looked at the code, I figured that Python UDF are implemented via query plans that retrieve a RDD via their children and that call mapPartitions of that RDD to work with batches of inputs. These query plans are generated by Catalyst by extracting Python UDFs (rule ExtractPythonUDFs). Like for Python UDFs, we could implement Scala batch UDFs with query plans to work with a batch of inputs instead of one input. What do you think ? Here is a very small description of one of our use cases of Spark that could greatly benefit from Scala batch UDFs: We are using Spark to distribute some computation run in C#. To do so, we call the method mapPartitions of the DataFrame that represents our data. Inside mapPartitions, we: * First connect to the C# process * Then iterate over the inputs by sending each input to the C# process and by getting back the results. The use of mapPartitions was motivated by the setup (connection to the C# process) that happens for each partition. Now that we have a first working version, we would like to improve it by limiting the columns to read. We don't want to select columns that are required
[jira] [Assigned] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33430: Assignee: Apache Spark > Support namespaces in JDBC v2 Table Catalog > --- > > Key: SPARK-33430 > URL: https://issues.apache.org/jira/browse/SPARK-33430 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > When I extend JDBCTableCatalogSuite by > org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance: > {code:scala} > import org.apache.spark.sql.execution.command.v2.ShowTablesSuite > class JDBCTableCatalogSuite extends ShowTablesSuite { > override def version: String = "JDBC V2" > override def catalog: String = "h2" > ... > {code} > some tests from JDBCTableCatalogSuite fail with: > {code} > [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 > seconds, 502 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does > not support namespaces; > [info] at > org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237592#comment-17237592 ] Apache Spark commented on SPARK-33430: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/30473 > Support namespaces in JDBC v2 Table Catalog > --- > > Key: SPARK-33430 > URL: https://issues.apache.org/jira/browse/SPARK-33430 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > When I extend JDBCTableCatalogSuite by > org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance: > {code:scala} > import org.apache.spark.sql.execution.command.v2.ShowTablesSuite > class JDBCTableCatalogSuite extends ShowTablesSuite { > override def version: String = "JDBC V2" > override def catalog: String = "h2" > ... > {code} > some tests from JDBCTableCatalogSuite fail with: > {code} > [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 > seconds, 502 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does > not support namespaces; > [info] at > org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33430: Assignee: (was: Apache Spark) > Support namespaces in JDBC v2 Table Catalog > --- > > Key: SPARK-33430 > URL: https://issues.apache.org/jira/browse/SPARK-33430 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > When I extend JDBCTableCatalogSuite by > org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance: > {code:scala} > import org.apache.spark.sql.execution.command.v2.ShowTablesSuite > class JDBCTableCatalogSuite extends ShowTablesSuite { > override def version: String = "JDBC V2" > override def catalog: String = "h2" > ... > {code} > some tests from JDBCTableCatalogSuite fail with: > {code} > [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 > seconds, 502 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does > not support namespaces; > [info] at > org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208) > [info] at > org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33519) Batch UDF in scala
[ https://issues.apache.org/jira/browse/SPARK-33519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaetan updated SPARK-33519: --- Description: Hello, Contrary to Python, there is only one type of Scala UDF, that let us define a Scala function to apply on a set of Column and which is called +for each row+. One advantage of Scala UDF over mapPartitions is that Catalyst is able to see what are the inputs which are then used for column pruning, predicate pushdown and other optimization rules. But in some use cases, there can be a setup phase that we only want to execute once per worker right before processing inputs. For such use cases, Scala UDF is not well suited and mapPartitions is used instead like this: {code:java} ds.mapPartitions( it => { setup() process(it) } ){code} After having looked at the code, I figured that Python UDF are implemented via query plans that retrieve a RDD via their children and that call mapPartitions of that RDD to work with batches of inputs. These query plans are generated by Catalyst by extracting Python UDFs (rule ExtractPythonUDFs). - *Implementation details*: 1. we could implement a new Expression ScalaBatchUDF and add a boolean isBatch to Expression that tells whether an Expression is batch or not. SparkPlan SelectExec, FilterExec (and probably more) would be modified to handle batch Expression: * Generated code would include code that call batch Expression with batch of inputs instead of one single input. * doExecute() method will call batch Expression with batch of inputs instead of one single input. A SparkPlan could be composed of "single" Expressions and batch expressions. It is a first idea that would need to be refined. 2. Another solution could also be to do as for Python UDFs: a batch UDF, implemented as Expression, is extracted from the query plan it belongs to and transformed into a query plan ScalaBatchUDFExec (which become child of the query plan that the batch UDF belongs to). What do you think ? - Here is a very small description of *one of our use cases of Spark* that could greatly benefit from Scala batch UDFs: We are using Spark to distribute some computation run in C#. To do so, we call the method mapPartitions of the DataFrame that represents our data. Inside mapPartitions, we: * First connect to the C# process * Then iterate over the inputs by sending each input to the C# process and by getting back the results. The use of mapPartitions was motivated by the setup (connection to the C# process) that happens for each partition. Now that we have a first working version, we would like to improve it by limiting the columns to read. We don't want to select columns that are required by our computation right before the mapPartitions because it would result in filtering out columns that could be required by other transformations in the workflow. Instead, we would like to take advantage of Catalyst for column pruning, predict pushdowns and other optimization rules. Using a Scala UDF to replace the mapPartitions would not be efficient because we would connect to the C# process for each row. An alternative would be a Scala "batch" UDF which would be applied on the columns that are needed for our computation, to take advantage of Catalyst and its optimizing rules, and which input would be an iterator like mapPartitions. was: Hello, Contrary to Python, there is only one type of Scala UDF, that let us define a Scala function to apply on a set of Column and which is called +for each row+. One advantage of Scala UDF over mapPartitions is that Catalyst is able to see what are the inputs which are then used for column pruning, predicate pushdown and other optimization rules. But in some use cases, there can be a setup phase that we only want to execute once per worker right before processing inputs. For such use cases, Scala UDF is not well suited and mapPartitions is used instead like this: {code:java} ds.mapPartitions( it => { setup() process(it) } ){code} After having looked at the code, I figured that Python UDF are implemented via query plans that retrieve a RDD via their children and that call mapPartitions of that RDD to work with batches of inputs. These query plans are generated by Catalyst by extracting Python UDFs (rule ExtractPythonUDFs). *Implementation details*: we could implement a new Expression ScalaBatchUDF and add a boolean isBatch to Expression that tells whether an Expression is batch or not. SparkPlan SelectExec, FilterExec (and probably more) would be modified to handle batch Expression: * Generated code would include code that call batch Expression with batch of inputs instead of one single input. * doExecute() method will call batch Expression with batch of inputs instead of one single input. A SparkPlan could be composed of "single" Expressions and batch expre
[jira] [Created] (SPARK-33521) Universal type conversion of V2 partition values
Maxim Gekk created SPARK-33521: -- Summary: Universal type conversion of V2 partition values Key: SPARK-33521 URL: https://issues.apache.org/jira/browse/SPARK-33521 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Support other types while resolving partition specs in https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33521) Universal type conversion of V2 partition values
[ https://issues.apache.org/jira/browse/SPARK-33521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33521: Assignee: (was: Apache Spark) > Universal type conversion of V2 partition values > > > Key: SPARK-33521 > URL: https://issues.apache.org/jira/browse/SPARK-33521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Support other types while resolving partition specs in > https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33521) Universal type conversion of V2 partition values
[ https://issues.apache.org/jira/browse/SPARK-33521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33521: Assignee: Apache Spark > Universal type conversion of V2 partition values > > > Key: SPARK-33521 > URL: https://issues.apache.org/jira/browse/SPARK-33521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Support other types while resolving partition specs in > https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33521) Universal type conversion of V2 partition values
[ https://issues.apache.org/jira/browse/SPARK-33521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237677#comment-17237677 ] Apache Spark commented on SPARK-33521: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30474 > Universal type conversion of V2 partition values > > > Key: SPARK-33521 > URL: https://issues.apache.org/jira/browse/SPARK-33521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Support other types while resolving partition specs in > https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33521) Universal type conversion of V2 partition values
[ https://issues.apache.org/jira/browse/SPARK-33521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237678#comment-17237678 ] Apache Spark commented on SPARK-33521: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30474 > Universal type conversion of V2 partition values > > > Key: SPARK-33521 > URL: https://issues.apache.org/jira/browse/SPARK-33521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Support other types while resolving partition specs in > https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-32918. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30163 [https://github.com/apache/spark/pull/30163] > RPC implementation to support control plane coordination for push-based > shuffle > --- > > Key: SPARK-32918 > URL: https://issues.apache.org/jira/browse/SPARK-32918 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Fix For: 3.1.0 > > > RPCs to facilitate coordination of shuffle map/reduce stages. Notifications > to external shuffle services to finalize shuffle block merge for a given > shuffle are carried through this RPC. It also respond back the metadata about > a merged shuffle partition back to the caller. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-32918: --- Assignee: Ye Zhou > RPC implementation to support control plane coordination for push-based > shuffle > --- > > Key: SPARK-32918 > URL: https://issues.apache.org/jira/browse/SPARK-32918 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Assignee: Ye Zhou >Priority: Major > Fix For: 3.1.0 > > > RPCs to facilitate coordination of shuffle map/reduce stages. Notifications > to external shuffle services to finalize shuffle block merge for a given > shuffle are carried through this RPC. It also respond back the metadata about > a merged shuffle partition back to the caller. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView
Terry Kim created SPARK-33522: - Summary: Improve exception messages while handling UnresolvedTableOrView Key: SPARK-33522 URL: https://issues.apache.org/jira/browse/SPARK-33522 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Terry Kim Improve exception messages while handling UnresolvedTableOrView. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView
[ https://issues.apache.org/jira/browse/SPARK-33522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237709#comment-17237709 ] Apache Spark commented on SPARK-33522: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/30475 > Improve exception messages while handling UnresolvedTableOrView > --- > > Key: SPARK-33522 > URL: https://issues.apache.org/jira/browse/SPARK-33522 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Improve exception messages while handling UnresolvedTableOrView. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView
[ https://issues.apache.org/jira/browse/SPARK-33522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33522: Assignee: Apache Spark > Improve exception messages while handling UnresolvedTableOrView > --- > > Key: SPARK-33522 > URL: https://issues.apache.org/jira/browse/SPARK-33522 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Minor > > Improve exception messages while handling UnresolvedTableOrView. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView
[ https://issues.apache.org/jira/browse/SPARK-33522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33522: Assignee: (was: Apache Spark) > Improve exception messages while handling UnresolvedTableOrView > --- > > Key: SPARK-33522 > URL: https://issues.apache.org/jira/browse/SPARK-33522 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Improve exception messages while handling UnresolvedTableOrView. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark
L. C. Hsieh created SPARK-33523: --- Summary: Add predicate related benchmark to SubExprEliminationBenchmark Key: SPARK-33523 URL: https://issues.apache.org/jira/browse/SPARK-33523 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh This is for the task to add predicate related benchmark to SubExprEliminationBenchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237740#comment-17237740 ] Dongjoon Hyun commented on SPARK-32808: --- `sql/core` module seems to be broken. I will file a new Jira. {code:java} $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 ... [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 milliseconds) [info] - SPARK-31255: Projects data column when metadata column has the same name *** FAILED *** (77 milliseconds){code} > Pass all `sql/core` module UTs in Scala 2.13 > > > Key: SPARK-32808 > URL: https://issues.apache.org/jira/browse/SPARK-32808 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.1.0 > > > Now there are 319 TESTS FAILED based on commit > `f5360e761ef161f7e04526b59a4baf53f1cf8cd5` > {code:java} > Run completed in 1 hour, 20 minutes, 25 seconds. > Total number of tests run: 8485 > Suites: completed 357, aborted 0 > Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 > *** 319 TESTS FAILED *** > {code} > > There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and > TPCDS_XXX_PlanStabilityWithStatsSuite: > * TPCDSV2_7_PlanStabilitySuite(33 FAILED) > * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED) > * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED) > * TPCDSV1_4_PlanStabilitySuite(92 FAILED) > * TPCDSModifiedPlanStabilitySuite(21 FAILED) > * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED) > > Other 26 FAILED cases as follow: > * StreamingAggregationSuite > ** count distinct - state format version 1 > ** count distinct - state format version 2 > * GeneratorFunctionSuite > ** explode and other columns > ** explode_outer and other columns > * UDFSuite > ** SPARK-26308: udf with complex types of decimal > ** SPARK-32459: UDF should not fail on WrappedArray > * SQLQueryTestSuite > ** decimalArithmeticOperations.sql > ** postgreSQL/aggregates_part2.sql > ** ansi/decimalArithmeticOperations.sql > ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF > ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF > * WholeStageCodegenSuite > ** SPARK-26680: Stream in groupBy does not cause StackOverflowError > * DataFrameSuite: > ** explode > ** SPARK-28067: Aggregate sum should not return wrong results for decimal > overflow > ** Star Expansion - ds.explode should fail with a meaningful message if it > takes a star > * DataStreamReaderWriterSuite > ** SPARK-18510: use user specified types for partition columns in file > sources > * OrcV1QuerySuite\OrcV2QuerySuite > ** Simple selection form ORC table * 2 > * ExpressionsSchemaSuite > ** Check schemas for expression examples > * DataFrameStatSuite > ** SPARK-28818: Respect original column nullability in `freqItems` > * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite > ** SPARK-4228 DataFrame to JSON * 3 > ** backward compatibility * 3 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33524) Fix DataSourceV2SQLSuite in Scala 2.13
Dongjoon Hyun created SPARK-33524: - Summary: Fix DataSourceV2SQLSuite in Scala 2.13 Key: SPARK-33524 URL: https://issues.apache.org/jira/browse/SPARK-33524 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Dongjoon Hyun `sql/core` module seems to be broken. I will file a new Jira. {code:java} $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 ... [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 milliseconds) [info] - SPARK-31255: Projects data column when metadata column has the same name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237741#comment-17237741 ] Dongjoon Hyun commented on SPARK-32808: --- I filed SPARK-33524 . > Pass all `sql/core` module UTs in Scala 2.13 > > > Key: SPARK-32808 > URL: https://issues.apache.org/jira/browse/SPARK-32808 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.1.0 > > > Now there are 319 TESTS FAILED based on commit > `f5360e761ef161f7e04526b59a4baf53f1cf8cd5` > {code:java} > Run completed in 1 hour, 20 minutes, 25 seconds. > Total number of tests run: 8485 > Suites: completed 357, aborted 0 > Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 > *** 319 TESTS FAILED *** > {code} > > There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and > TPCDS_XXX_PlanStabilityWithStatsSuite: > * TPCDSV2_7_PlanStabilitySuite(33 FAILED) > * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED) > * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED) > * TPCDSV1_4_PlanStabilitySuite(92 FAILED) > * TPCDSModifiedPlanStabilitySuite(21 FAILED) > * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED) > > Other 26 FAILED cases as follow: > * StreamingAggregationSuite > ** count distinct - state format version 1 > ** count distinct - state format version 2 > * GeneratorFunctionSuite > ** explode and other columns > ** explode_outer and other columns > * UDFSuite > ** SPARK-26308: udf with complex types of decimal > ** SPARK-32459: UDF should not fail on WrappedArray > * SQLQueryTestSuite > ** decimalArithmeticOperations.sql > ** postgreSQL/aggregates_part2.sql > ** ansi/decimalArithmeticOperations.sql > ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF > ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF > * WholeStageCodegenSuite > ** SPARK-26680: Stream in groupBy does not cause StackOverflowError > * DataFrameSuite: > ** explode > ** SPARK-28067: Aggregate sum should not return wrong results for decimal > overflow > ** Star Expansion - ds.explode should fail with a meaningful message if it > takes a star > * DataStreamReaderWriterSuite > ** SPARK-18510: use user specified types for partition columns in file > sources > * OrcV1QuerySuite\OrcV2QuerySuite > ** Simple selection form ORC table * 2 > * ExpressionsSchemaSuite > ** Check schemas for expression examples > * DataFrameStatSuite > ** SPARK-28818: Respect original column nullability in `freqItems` > * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite > ** SPARK-4228 DataFrame to JSON * 3 > ** backward compatibility * 3 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark
[ https://issues.apache.org/jira/browse/SPARK-33523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237759#comment-17237759 ] Apache Spark commented on SPARK-33523: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30476 > Add predicate related benchmark to SubExprEliminationBenchmark > -- > > Key: SPARK-33523 > URL: https://issues.apache.org/jira/browse/SPARK-33523 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > This is for the task to add predicate related benchmark to > SubExprEliminationBenchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark
[ https://issues.apache.org/jira/browse/SPARK-33523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33523: Assignee: Apache Spark (was: L. C. Hsieh) > Add predicate related benchmark to SubExprEliminationBenchmark > -- > > Key: SPARK-33523 > URL: https://issues.apache.org/jira/browse/SPARK-33523 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > This is for the task to add predicate related benchmark to > SubExprEliminationBenchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark
[ https://issues.apache.org/jira/browse/SPARK-33523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33523: Assignee: L. C. Hsieh (was: Apache Spark) > Add predicate related benchmark to SubExprEliminationBenchmark > -- > > Key: SPARK-33523 > URL: https://issues.apache.org/jira/browse/SPARK-33523 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > This is for the task to add predicate related benchmark to > SubExprEliminationBenchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33524) Fix DataSourceV2SQLSuite in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33524: Assignee: Apache Spark > Fix DataSourceV2SQLSuite in Scala 2.13 > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33524) Fix DataSourceV2SQLSuite in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237761#comment-17237761 ] Apache Spark commented on SPARK-33524: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30477 > Fix DataSourceV2SQLSuite in Scala 2.13 > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33524) Fix DataSourceV2SQLSuite in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33524: Assignee: (was: Apache Spark) > Fix DataSourceV2SQLSuite in Scala 2.13 > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33513) Upgrade to Scala 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-33513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33513. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30455 [https://github.com/apache/spark/pull/30455] > Upgrade to Scala 2.13.4 > --- > > Key: SPARK-33513 > URL: https://issues.apache.org/jira/browse/SPARK-33513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33513) Upgrade to Scala 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-33513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33513: - Assignee: Dongjoon Hyun > Upgrade to Scala 2.13.4 > --- > > Key: SPARK-33513 > URL: https://issues.apache.org/jira/browse/SPARK-33513 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33524) Change `BucketTransform` not to use Tuple.hashCode
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33524: -- Summary: Change `BucketTransform` not to use Tuple.hashCode (was: Fix DataSourceV2SQLSuite in Scala 2.13) > Change `BucketTransform` not to use Tuple.hashCode > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33524) Change `BucketTransform` not to use Tuple.hashCode
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33524: -- Affects Version/s: 3.0.1 > Change `BucketTransform` not to use Tuple.hashCode > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33524) Change `BucketTransform` not to use Tuple.hashCode
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33524: -- Component/s: Tests > Change `BucketTransform` not to use Tuple.hashCode > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33513) Upgrade to Scala 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-33513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33513: -- Parent: SPARK-25075 Issue Type: Sub-task (was: Improvement) > Upgrade to Scala 2.13.4 > --- > > Key: SPARK-33513 > URL: https://issues.apache.org/jira/browse/SPARK-33513 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-33516) Upgrade Scala 2.13 from 2.13.3 to 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-33516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-33516. - > Upgrade Scala 2.13 from 2.13.3 to 2.13.4 > > > Key: SPARK-33516 > URL: https://issues.apache.org/jira/browse/SPARK-33516 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > Scala 2.13.4 released(https://github.com/scala/scala/releases/tag/v2.13.4) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33516) Upgrade Scala 2.13 from 2.13.3 to 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-33516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33516. --- Resolution: Duplicate > Upgrade Scala 2.13 from 2.13.3 to 2.13.4 > > > Key: SPARK-33516 > URL: https://issues.apache.org/jira/browse/SPARK-33516 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > Scala 2.13.4 released(https://github.com/scala/scala/releases/tag/v2.13.4) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33524) Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33524: -- Summary: Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` (was: Change `BucketTransform` not to use Tuple.hashCode) > Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2
Yuming Wang created SPARK-33525: --- Summary: Upgrade hive-service-rpc to 3.1.2 Key: SPARK-33525 URL: https://issues.apache.org/jira/browse/SPARK-33525 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang We supported Hive metastore are 0.12.0 through 3.1.2. but we supported hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we use hive-jdbc 3.x: {noformat} [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:1/default Connecting to jdbc:hive2://localhost:1/default Connected to: Spark SQL (version 3.1.0-SNAPSHOT) Driver: Hive JDBC (version 3.1.2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.2 by Apache Hive 0: jdbc:hive2://localhost:1/default> create table t1(id int) using parquet; Unexpected end of file when reading from HS2 server. The root cause might be too many concurrent connections. Please ask the administrator to check the number of active connections, and adjust hive.server2.thrift.max.worker.threads if applicable. Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) {noformat} {noformat} org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client? at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) {noformat} We can upgrade hive-service-rpc to 3.1.2 to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237815#comment-17237815 ] Yuming Wang commented on SPARK-33525: - We should handle CLI_ODBC_KEYWORDS in SparkSQLCLIService to workaround this issue: {noformat} 20/11/23 20:03:09 WARN ThriftCLIService: Error getting info: org.apache.hive.service.cli.HiveSQLException: Unrecognized GetInfoType value: CLI_ODBC_KEYWORDS at org.apache.hive.service.cli.session.HiveSessionImpl.getInfo(HiveSessionImpl.java:444) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.base/java.security.AccessController.doPrivileged(AccessController.java:691) at java.base/javax.security.auth.Subject.doAs(Subject.java:425) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy23.getInfo(Unknown Source) at org.apache.hive.service.cli.CLIService.getInfo(CLIService.java:250) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIService.getInfo(SparkSQLCLIService.scala:107) at org.apache.hive.service.cli.thrift.ThriftCLIService.GetInfo(ThriftCLIService.java:440) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetInfo.getResult(TCLIService.java:1537) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetInfo.getResult(TCLIService.java:1522) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) {noformat} > Upgrade hive-service-rpc to 3.1.2 > - > > Key: SPARK-33525 > URL: https://issues.apache.org/jira/browse/SPARK-33525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > We supported Hive metastore are 0.12.0 through 3.1.2. but we supported > hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we > use hive-jdbc 3.x: > {noformat} > [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u > jdbc:hive2://localhost:1/default > Connecting to jdbc:hive2://localhost:1/default > Connected to: Spark SQL (version 3.1.0-SNAPSHOT) > Driver: Hive JDBC (version 3.1.2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.2 by Apache Hive > 0: jdbc:hive2://localhost:1/default> create table t1(id int) using > parquet; > Unexpected end of file when reading from HS2 server. The root cause might be > too many concurrent connections. Please ask the administrator to check the > number of active connections, and adjust > hive.server2.thrift.max.worker.threads if applicable. > Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) > {noformat} > {noformat} > org.apache.thrift.protocol.TProtocolException: Missing version in > readMessageBegin, old client? > at > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) > at java.base/java.lang.Thread.run(Thread.java:832) > {noformat} > We can upgrade hive-service-rpc to 3.1.2 to fix
[jira] [Commented] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237822#comment-17237822 ] Apache Spark commented on SPARK-33525: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30478 > Upgrade hive-service-rpc to 3.1.2 > - > > Key: SPARK-33525 > URL: https://issues.apache.org/jira/browse/SPARK-33525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > We supported Hive metastore are 0.12.0 through 3.1.2. but we supported > hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we > use hive-jdbc 3.x: > {noformat} > [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u > jdbc:hive2://localhost:1/default > Connecting to jdbc:hive2://localhost:1/default > Connected to: Spark SQL (version 3.1.0-SNAPSHOT) > Driver: Hive JDBC (version 3.1.2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.2 by Apache Hive > 0: jdbc:hive2://localhost:1/default> create table t1(id int) using > parquet; > Unexpected end of file when reading from HS2 server. The root cause might be > too many concurrent connections. Please ask the administrator to check the > number of active connections, and adjust > hive.server2.thrift.max.worker.threads if applicable. > Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) > {noformat} > {noformat} > org.apache.thrift.protocol.TProtocolException: Missing version in > readMessageBegin, old client? > at > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) > at java.base/java.lang.Thread.run(Thread.java:832) > {noformat} > We can upgrade hive-service-rpc to 3.1.2 to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33525: Assignee: (was: Apache Spark) > Upgrade hive-service-rpc to 3.1.2 > - > > Key: SPARK-33525 > URL: https://issues.apache.org/jira/browse/SPARK-33525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > We supported Hive metastore are 0.12.0 through 3.1.2. but we supported > hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we > use hive-jdbc 3.x: > {noformat} > [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u > jdbc:hive2://localhost:1/default > Connecting to jdbc:hive2://localhost:1/default > Connected to: Spark SQL (version 3.1.0-SNAPSHOT) > Driver: Hive JDBC (version 3.1.2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.2 by Apache Hive > 0: jdbc:hive2://localhost:1/default> create table t1(id int) using > parquet; > Unexpected end of file when reading from HS2 server. The root cause might be > too many concurrent connections. Please ask the administrator to check the > number of active connections, and adjust > hive.server2.thrift.max.worker.threads if applicable. > Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) > {noformat} > {noformat} > org.apache.thrift.protocol.TProtocolException: Missing version in > readMessageBegin, old client? > at > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) > at java.base/java.lang.Thread.run(Thread.java:832) > {noformat} > We can upgrade hive-service-rpc to 3.1.2 to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33525: Assignee: Apache Spark > Upgrade hive-service-rpc to 3.1.2 > - > > Key: SPARK-33525 > URL: https://issues.apache.org/jira/browse/SPARK-33525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > We supported Hive metastore are 0.12.0 through 3.1.2. but we supported > hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we > use hive-jdbc 3.x: > {noformat} > [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u > jdbc:hive2://localhost:1/default > Connecting to jdbc:hive2://localhost:1/default > Connected to: Spark SQL (version 3.1.0-SNAPSHOT) > Driver: Hive JDBC (version 3.1.2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.2 by Apache Hive > 0: jdbc:hive2://localhost:1/default> create table t1(id int) using > parquet; > Unexpected end of file when reading from HS2 server. The root cause might be > too many concurrent connections. Please ask the administrator to check the > number of active connections, and adjust > hive.server2.thrift.max.worker.threads if applicable. > Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) > {noformat} > {noformat} > org.apache.thrift.protocol.TProtocolException: Missing version in > readMessageBegin, old client? > at > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) > at java.base/java.lang.Thread.run(Thread.java:832) > {noformat} > We can upgrade hive-service-rpc to 3.1.2 to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237823#comment-17237823 ] Apache Spark commented on SPARK-33525: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30478 > Upgrade hive-service-rpc to 3.1.2 > - > > Key: SPARK-33525 > URL: https://issues.apache.org/jira/browse/SPARK-33525 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > We supported Hive metastore are 0.12.0 through 3.1.2. but we supported > hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we > use hive-jdbc 3.x: > {noformat} > [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u > jdbc:hive2://localhost:1/default > Connecting to jdbc:hive2://localhost:1/default > Connected to: Spark SQL (version 3.1.0-SNAPSHOT) > Driver: Hive JDBC (version 3.1.2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.2 by Apache Hive > 0: jdbc:hive2://localhost:1/default> create table t1(id int) using > parquet; > Unexpected end of file when reading from HS2 server. The root cause might be > too many concurrent connections. Please ask the administrator to check the > number of active connections, and adjust > hive.server2.thrift.max.worker.threads if applicable. > Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) > {noformat} > {noformat} > org.apache.thrift.protocol.TProtocolException: Missing version in > readMessageBegin, old client? > at > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) > at java.base/java.lang.Thread.run(Thread.java:832) > {noformat} > We can upgrade hive-service-rpc to 3.1.2 to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33524) Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33524: - Assignee: Dongjoon Hyun > Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33524) Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`
[ https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33524. --- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30477 [https://github.com/apache/spark/pull/30477] > Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` > -- > > Key: SPARK-33524 > URL: https://issues.apache.org/jira/browse/SPARK-33524 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.1, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > `sql/core` module seems to be broken. I will file a new Jira. > {code:java} > $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13 > ... > [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 > milliseconds) > [info] - SPARK-31255: Projects data column when metadata column has the same > name *** FAILED *** (77 milliseconds){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33508) Why the user can not assign other `key.deserializer` and why it should be alway `ByteArrayDeserializer`?
[ https://issues.apache.org/jira/browse/SPARK-33508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33508. -- Resolution: Invalid Let's ask questions into mailing lists before filing it as an issue. See also http://spark.apache.org/contributing.html > Why the user can not assign other `key.deserializer` and why it should be > alway `ByteArrayDeserializer`? > > > Key: SPARK-33508 > URL: https://issues.apache.org/jira/browse/SPARK-33508 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Sayed Mohammad Hossein Torabi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33501: - Description: If we read with mulitLine true and encoding with "ISO-8859-1" then we are getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we read with multiLine false and encoding with "ISO-8859-1" thne we are getting value like {color:#ff}AUTO EL*É*TRICA{color} Below is the code we are using {code} spark.read().option("header", "true").option("inferSchema", true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() {code} Sample file is attached in attachement was: If we read with mulitLine true and encoding with "ISO-8859-1" then we are getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we read with multiLine false and encoding with "ISO-8859-1" thne we are getting value like {color:#ff}AUTO EL*É*TRICA{color} Below is the code we are using Dataset dataset1 = SparkUtil.getSparkSession().read().Dataset dataset1 = SparkUtil.getSparkSession().read(). option("header", "true"). option("inferSchema", true). option("delimiter", ";") .option("quote", "\"") .option("multiLine", true) .option("encoding", "ISO-8859-1") .csv("file path"); dataset1.show(); Sample file is attached in attachement > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237851#comment-17237851 ] Hyukjin Kwon commented on SPARK-33501: -- I can't reproduce this: {code} scala> spark.read.option("header", "true").option("inferSchema", true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() +-+---+ | [DS_CANAL]|[DS_FORMULARIO]| +-+---+ |AUTO ELÉTRICA|Shop. de Preço 60Ah| |AUTO ELÉTRICA|Shop. de Preço 60Ah| |AUTO ELÉTRICA|Shop. de Preço 60Ah| |AUTO ELÉTRICA|Shop. de Preço 60Ah| |AUTO ELÉTRICA|Shop. de Preço 60Ah| +-+---+ {code} > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33501. -- Resolution: Cannot Reproduce > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type
[ https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237853#comment-17237853 ] Hyukjin Kwon commented on SPARK-33489: -- cc [~bryanc] FYI. Does Arrow support null type? > Support null for conversion from and to Arrow type > -- > > Key: SPARK-33489 > URL: https://issues.apache.org/jira/browse/SPARK-33489 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Yuya Kanai >Priority: Minor > > I got below error when using from_arrow_type() in pyspark.sql.pandas.types > {{Unsupported type in conversion from Arrow: null}} > I noticed NullType exists under pyspark.sql.types so it seems possible to > convert from pyarrow null to pyspark null type and vice versa. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33488) Re SPARK-21820. Creating Spark dataframe with carriage return/line feed leaves cr in multiline
[ https://issues.apache.org/jira/browse/SPARK-33488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33488. -- Resolution: Cannot Reproduce > Re SPARK-21820. Creating Spark dataframe with carriage return/line feed > leaves cr in multiline > --- > > Key: SPARK-33488 > URL: https://issues.apache.org/jira/browse/SPARK-33488 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Apache 2.4.5 > Databricks 6.6 > Spark-NLP 2.6.3 >Reporter: Greg Werner >Priority: Major > > In SPARK-21820 I see what seems to be the same issue reported, but marked as > resolved there. Over the past few days I have battled a dataset that > occasionally has \r\n at the end of lines and I claim I do see this errant > behavior of not removing \r\n. > In my code, I do > {code:java} > // code placeholder# CSV options > infer_schema = "false" > first_row_is_header = "true" > multi_line = "true" > delimiter = "," > # The applied options are for CSV files. For other file types, these will be > ignored. > df_train = spark.read.format(train_file_type) \ > .option("inferSchema", infer_schema) \ > .option("header", first_row_is_header) \ > .option("sep", delimiter) \ > .option("multiLine", multi_line) \ > .option("escape", '"') \ > .load(train_file_location) > {code} > So I am reading in a csv file and setting multiLine to true. However, all > cases where there are \r\n in the training_file, \r is left behind. This > includes the header which has a column ending in \r. The only way I have > been able to workaround this is to manually edit the data file to remove the > \r, but I do not want to do this on a case to case basis. > Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug. > I am using version 2.4.5 because I am using Spark-NLP which to my knowledge > has not been built to use 3 yet, so the version is key for me. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33488) Re SPARK-21820. Creating Spark dataframe with carriage return/line feed leaves cr in multiline
[ https://issues.apache.org/jira/browse/SPARK-33488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237855#comment-17237855 ] Hyukjin Kwon commented on SPARK-33488: -- In Spark now you can set lineSep option > Re SPARK-21820. Creating Spark dataframe with carriage return/line feed > leaves cr in multiline > --- > > Key: SPARK-33488 > URL: https://issues.apache.org/jira/browse/SPARK-33488 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Apache 2.4.5 > Databricks 6.6 > Spark-NLP 2.6.3 >Reporter: Greg Werner >Priority: Major > > In SPARK-21820 I see what seems to be the same issue reported, but marked as > resolved there. Over the past few days I have battled a dataset that > occasionally has \r\n at the end of lines and I claim I do see this errant > behavior of not removing \r\n. > In my code, I do > {code:java} > // code placeholder# CSV options > infer_schema = "false" > first_row_is_header = "true" > multi_line = "true" > delimiter = "," > # The applied options are for CSV files. For other file types, these will be > ignored. > df_train = spark.read.format(train_file_type) \ > .option("inferSchema", infer_schema) \ > .option("header", first_row_is_header) \ > .option("sep", delimiter) \ > .option("multiLine", multi_line) \ > .option("escape", '"') \ > .load(train_file_location) > {code} > So I am reading in a csv file and setting multiLine to true. However, all > cases where there are \r\n in the training_file, \r is left behind. This > includes the header which has a column ending in \r. The only way I have > been able to workaround this is to manually edit the data file to remove the > \r, but I do not want to do this on a case to case basis. > Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug. > I am using version 2.4.5 because I am using Spark-NLP which to my knowledge > has not been built to use 3 yet, so the version is key for me. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33485) running spark application in kerbernetes,bug the application log shows yarn authentications
[ https://issues.apache.org/jira/browse/SPARK-33485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33485: - Target Version/s: (was: 3.0.0) > running spark application in kerbernetes,bug the application log shows yarn > authentications > > > Key: SPARK-33485 > URL: https://issues.apache.org/jira/browse/SPARK-33485 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Yuan Jiao >Priority: Major > Attachments: application.log, project.rar > > > My spark application accessing kerberized HDFS is running in kubernetes > cluster, but the application log shows: "Setting > spark.hadoop.yarn.resourcemanager.principal to tester(which is one of my > kerberos principals, yet I uses the other principal joan to read HDFS files)": > ... > + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client > "$@") > + exec /usr/bin/tini -s – /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=10.244.1.61 --deploy-mode client --properties-file > /opt/spark/conf/spark.properties --class WordCount > local:///opt/spark/jars/WordCount-1.0-SNAPSHOT.jar > *Setting spark.hadoop.yarn.resourcemanager.principal to tester* > ... > 20/11/19 04:31:28 INFO HadoopFSDelegationTokenProvider: getting token for: > DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1041285450_1, > ugi=*tester@JOANTEST* (auth:KERBEROS)]] with renewer tester > 20/11/19 04:31:37 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 60 for > tester on ha-hdfs:nameservice1 > 20/11/19 04:31:37 INFO HadoopFSDelegationTokenProvider: getting token for: > DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1041285450_1, > ugi=*tester@JOANTEST* (auth:KERBEROS)]] with renewer tester@JOANTEST > 20/11/19 04:31:37 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 61 for > *tester* on ha-hdfs:nameservice1 > 20/11/19 04:31:37 INFO HadoopFSDelegationTokenProvider: Renewal interval is > 86400073 for token HDFS_DELEGATION_TOKEN > ... > 20/11/19 04:31:51 INFO UserGroupInformation: *Login successful for user joan > using keytab file /opt/hadoop/conf/joan.keytab* > ... > > I don't know why yarn authentication is needed here?And why use the principal > tester for autherization? Anyone can help? Thanks ! > The log and my spark project is attached blow for reference. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark
[ https://issues.apache.org/jira/browse/SPARK-33523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33523. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30476 [https://github.com/apache/spark/pull/30476] > Add predicate related benchmark to SubExprEliminationBenchmark > -- > > Key: SPARK-33523 > URL: https://issues.apache.org/jira/browse/SPARK-33523 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > This is for the task to add predicate related benchmark to > SubExprEliminationBenchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nilesh Patil updated SPARK-33501: - Attachment: Screenshot from 2020-11-24 10-27-17.png > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv, Screenshot from 2020-11-24 > 10-27-17.png > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237869#comment-17237869 ] Nilesh Patil commented on SPARK-33501: -- [~hyukjin.kwon] Please refer attached screen shot. with multiline true & false. I am able to reproduce same with java api also. !Screenshot from 2020-11-24 10-27-17.png! > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv, Screenshot from 2020-11-24 > 10-27-17.png > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nilesh Patil updated SPARK-33501: - Affects Version/s: 2.4.3 > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.3 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv, Screenshot from 2020-11-24 > 10-27-17.png > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237877#comment-17237877 ] Hyukjin Kwon commented on SPARK-33501: -- Can you try with Spark 3.0? > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.3 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv, Screenshot from 2020-11-24 > 10-27-17.png > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237882#comment-17237882 ] Nilesh Patil commented on SPARK-33501: -- Yes In 3.0 its working. Is it possible to get fix in 2.3 or in 2.4 version ? > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.3 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv, Screenshot from 2020-11-24 > 10-27-17.png > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.
[ https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237893#comment-17237893 ] Hyukjin Kwon commented on SPARK-33501: -- 2.3 is EOL. For 2.4, we could maybe think about it. You can identify the JIRA fixed this issue and ask me or other people to assess. > Encoding is not working if multiLine option is true. > > > Key: SPARK-33501 > URL: https://issues.apache.org/jira/browse/SPARK-33501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.3 >Reporter: Nilesh Patil >Priority: Major > Attachments: 1605860036183.csv, Screenshot from 2020-11-24 > 10-27-17.png > > > If we read with mulitLine true and encoding with "ISO-8859-1" then we are > getting value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we > read with multiLine false and encoding with "ISO-8859-1" thne we are getting > value like {color:#ff}AUTO EL*É*TRICA{color} > Below is the code we are using > {code} > spark.read().option("header", "true").option("inferSchema", > true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", > true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show() > {code} > Sample file is attached in attachement -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.
[ https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-32221: Description: This would avoid failures, in case the files are a bit large or a user places a binary file inside the SPARK_CONF_DIR. Both of which are not supported at the moment. The reason is, underlying etcd store does limit the size of each entry to only 1.5 MiB. [https://etcd.io/docs/v3.4.0/dev-guide/limit/] We can apply a straightforward approach of skipping files that cannot be accommodated within 1.5MiB limit (limit is configurable as per above link) and WARNING the user about the same. For most use cases, this limit is more than sufficient, however a user may accidentally place a larger file and observe an unpredictable result or failures at run time. was: This would avoid failures, in case the files are a bit large or a user places a binary file inside the SPARK_CONF_DIR. Both of which are not supported at the moment. The reason is, underlying etcd store does limit the size of each entry to only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s clusters, then we can hope to overcome this limitation. e.g. [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for higher limit on each entry. Even if that does not happen, there are other ways to overcome this limitation, for example, we can have config files split across multiple configMaps. We need to discuss, and prioritise, this issue takes the straightforward approach of skipping files that cannot be accommodated within 1.5MiB limit and WARNING the user about the same. > Avoid possible errors due to incorrect file size or type supplied in spark > conf. > > > Key: SPARK-32221 > URL: https://issues.apache.org/jira/browse/SPARK-32221 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Priority: Major > > This would avoid failures, in case the files are a bit large or a user places > a binary file inside the SPARK_CONF_DIR. > Both of which are not supported at the moment. > The reason is, underlying etcd store does limit the size of each entry to > only 1.5 MiB. > [https://etcd.io/docs/v3.4.0/dev-guide/limit/] > We can apply a straightforward approach of skipping files that cannot be > accommodated within 1.5MiB limit (limit is configurable as per above link) > and WARNING the user about the same. > For most use cases, this limit is more than sufficient, however a user may > accidentally place a larger file and observe an unpredictable result or > failures at run time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33526) Add config to control if cancel invoke interrupt task on thriftserver
ulysses you created SPARK-33526: --- Summary: Add config to control if cancel invoke interrupt task on thriftserver Key: SPARK-33526 URL: https://issues.apache.org/jira/browse/SPARK-33526 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you After [#29933|https://github.com/apache/spark/pull/29933], we support cancel query if timeout, but the default behavior of `SparkContext.cancelJobGroups` won't interrupt task and just let task finish by itself. In some case it's dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a long time after do cancel and the resource will not release. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33527) Extend the function of decode
jiaan.geng created SPARK-33527: -- Summary: Extend the function of decode Key: SPARK-33527 URL: https://issues.apache.org/jira/browse/SPARK-33527 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33527) Extend the function of decode
[ https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-33527: --- Description: In Spark, decode(bin, charset) - Decodes the first argument using the second argument character set. Unfortunately this is NOT what any other SQL vendor understands DECODE to do. DECODE generally is a short hand for a simple case expression: {code:java} SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS T(c1) => (Hello), (World) (!) {code} > Extend the function of decode > - > > Key: SPARK-33527 > URL: https://issues.apache.org/jira/browse/SPARK-33527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > In Spark, decode(bin, charset) - Decodes the first argument using the second > argument character set. > Unfortunately this is NOT what any other SQL vendor understands DECODE to do. > DECODE generally is a short hand for a simple case expression: > {code:java} > SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS > T(c1) > => > (Hello), > (World) > (!) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-33527: --- Summary: Extend the function of decode so as consistent with mainstream databases (was: Extend the function of decode) > Extend the function of decode so as consistent with mainstream databases > > > Key: SPARK-33527 > URL: https://issues.apache.org/jira/browse/SPARK-33527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > In Spark, decode(bin, charset) - Decodes the first argument using the second > argument character set. > Unfortunately this is NOT what any other SQL vendor understands DECODE to do. > DECODE generally is a short hand for a simple case expression: > {code:java} > SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS > T(c1) > => > (Hello), > (World) > (!) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33527: Assignee: (was: Apache Spark) > Extend the function of decode so as consistent with mainstream databases > > > Key: SPARK-33527 > URL: https://issues.apache.org/jira/browse/SPARK-33527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > In Spark, decode(bin, charset) - Decodes the first argument using the second > argument character set. > Unfortunately this is NOT what any other SQL vendor understands DECODE to do. > DECODE generally is a short hand for a simple case expression: > {code:java} > SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS > T(c1) > => > (Hello), > (World) > (!) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33527: Assignee: Apache Spark > Extend the function of decode so as consistent with mainstream databases > > > Key: SPARK-33527 > URL: https://issues.apache.org/jira/browse/SPARK-33527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > In Spark, decode(bin, charset) - Decodes the first argument using the second > argument character set. > Unfortunately this is NOT what any other SQL vendor understands DECODE to do. > DECODE generally is a short hand for a simple case expression: > {code:java} > SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS > T(c1) > => > (Hello), > (World) > (!) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237932#comment-17237932 ] Apache Spark commented on SPARK-33527: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30479 > Extend the function of decode so as consistent with mainstream databases > > > Key: SPARK-33527 > URL: https://issues.apache.org/jira/browse/SPARK-33527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > In Spark, decode(bin, charset) - Decodes the first argument using the second > argument character set. > Unfortunately this is NOT what any other SQL vendor understands DECODE to do. > DECODE generally is a short hand for a simple case expression: > {code:java} > SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS > T(c1) > => > (Hello), > (World) > (!) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases
[ https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237933#comment-17237933 ] Apache Spark commented on SPARK-33527: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/30479 > Extend the function of decode so as consistent with mainstream databases > > > Key: SPARK-33527 > URL: https://issues.apache.org/jira/browse/SPARK-33527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > In Spark, decode(bin, charset) - Decodes the first argument using the second > argument character set. > Unfortunately this is NOT what any other SQL vendor understands DECODE to do. > DECODE generally is a short hand for a simple case expression: > {code:java} > SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS > T(c1) > => > (Hello), > (World) > (!) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org