date:20201123



[ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237224#comment-17237224
 ] 

Hyukjin Kwon commented on SPARK-32481:
--

Reverted in https://github.com/apache/spark/pull/30463

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Assignee: Udbhav Agrawal
>Priority: Minor
> Fix For: 3.1.0
>
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32481) Support truncate table to move the data to trash



 [ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32481:
-
Fix Version/s: (was: 3.1.0)

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Assignee: Udbhav Agrawal
>Priority: Minor
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-32481) Support truncate table to move the data to trash



 [ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-32481:
--
  Assignee: (was: Udbhav Agrawal)

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32481) Support truncate table to move the data to trash



 [ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32481:


Assignee: Apache Spark

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Assignee: Apache Spark
>Priority: Minor
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32481) Support truncate table to move the data to trash



 [ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32481:


Assignee: (was: Apache Spark)

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33515) Improve exception messages while handling UnresolvedTable



 [ 
https://issues.apache.org/jira/browse/SPARK-33515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33515:
---

Assignee: Terry Kim

> Improve exception messages while handling UnresolvedTable
> -
>
> Key: SPARK-33515
> URL: https://issues.apache.org/jira/browse/SPARK-33515
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> Improve exception messages while handling UnresolvedTable by adding command 
> name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33515) Improve exception messages while handling UnresolvedTable



 [ 
https://issues.apache.org/jira/browse/SPARK-33515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33515.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30461
[https://github.com/apache/spark/pull/30461]

> Improve exception messages while handling UnresolvedTable
> -
>
> Key: SPARK-33515
> URL: https://issues.apache.org/jira/browse/SPARK-33515
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> Improve exception messages while handling UnresolvedTable by adding command 
> name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33511) Respect case sensitivity in resolving partition specs V2



 [ 
https://issues.apache.org/jira/browse/SPARK-33511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33511.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30454
[https://github.com/apache/spark/pull/30454]

> Respect case sensitivity in resolving partition specs V2
> 
>
> Key: SPARK-33511
> URL: https://issues.apache.org/jira/browse/SPARK-33511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> DSv1 DDL commands respect the SQL config spark.sql.caseSensitive, for example
> {code:java}
> spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet 
> PARTITIONED BY (id);
> spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1);
> spark-sql> SHOW PARTITIONS tbl1;
> id=1
> {code}
> but the same ALTER TABLE command fails on DSv2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33511) Respect case sensitivity in resolving partition specs V2



 [ 
https://issues.apache.org/jira/browse/SPARK-33511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33511:
---

Assignee: Maxim Gekk

> Respect case sensitivity in resolving partition specs V2
> 
>
> Key: SPARK-33511
> URL: https://issues.apache.org/jira/browse/SPARK-33511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> DSv1 DDL commands respect the SQL config spark.sql.caseSensitive, for example
> {code:java}
> spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING parquet 
> PARTITIONED BY (id);
> spark-sql> ALTER TABLE tbl1 ADD PARTITION (ID=1);
> spark-sql> SHOW PARTITIONS tbl1;
> id=1
> {code}
> but the same ALTER TABLE command fails on DSv2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow

2020-11-23 Thread liucht-inspur (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liucht-inspur updated SPARK-33517:
--
Description: 
Error setting menu item and link, change "Apache Arrow in Spark" to "Apache 
Arrow in PySpark"

  !image-2020-11-23-18-47-01-591.png!

  was:
Error setting menu item and link, change "Apache Arrow in Spark" to "Apache 
Arrow in PySpark"

 


> Incorrect menu item display and link in PySpark Usage Guide for Pandas with 
> Apache Arrow
> 
>
> Key: SPARK-33517
> URL: https://issues.apache.org/jira/browse/SPARK-33517
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.0.1
>Reporter: liucht-inspur
>Priority: Minor
> Attachments: image-2020-11-23-18-47-01-591.png, spark-doc.jpg
>
>
> Error setting menu item and link, change "Apache Arrow in Spark" to "Apache 
> Arrow in PySpark"
>   !image-2020-11-23-18-47-01-591.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33517) Incorrect menu item display and link in PySpark Usage Guide for Pandas with Apache Arrow

2020-11-23 Thread liucht-inspur (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liucht-inspur updated SPARK-33517:
--
Attachment: image-2020-11-23-18-47-01-591.png

> Incorrect menu item display and link in PySpark Usage Guide for Pandas with 
> Apache Arrow
> 
>
> Key: SPARK-33517
> URL: https://issues.apache.org/jira/browse/SPARK-33517
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.0.1
>Reporter: liucht-inspur
>Priority: Minor
> Attachments: image-2020-11-23-18-47-01-591.png, spark-doc.jpg
>
>
> Error setting menu item and link, change "Apache Arrow in Spark" to "Apache 
> Arrow in PySpark"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV

2020-11-23 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-33518:


 Summary: Improve performance of ML ALS recommendForAll by GEMV
 Key: SPARK-33518
 URL: https://issues.apache.org/jira/browse/SPARK-33518
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng


There were a lot of works on improving ALS's {{recommendForAll}}

For now, I found that it maybe futhermore optimized by

1, using GEMV;

2, directly aggregate on topK collections (srcId, Array(dstId), Array(score)), 
instead of each element (srcId, (dstId, score));

3, use guava.ordering instead of BoundedPriorityQueue;

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV



 [ 
https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33518:


Assignee: Apache Spark

> Improve performance of ML ALS recommendForAll by GEMV
> -
>
> Key: SPARK-33518
> URL: https://issues.apache.org/jira/browse/SPARK-33518
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> There were a lot of works on improving ALS's {{recommendForAll}}
> For now, I found that it maybe futhermore optimized by
> 1, using GEMV;
> 2, directly aggregate on topK collections (srcId, Array(dstId), 
> Array(score)), instead of each element (srcId, (dstId, score));
> 3, use guava.ordering instead of BoundedPriorityQueue;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV



[ 
https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237279#comment-17237279
 ] 

Apache Spark commented on SPARK-33518:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/30468

> Improve performance of ML ALS recommendForAll by GEMV
> -
>
> Key: SPARK-33518
> URL: https://issues.apache.org/jira/browse/SPARK-33518
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> There were a lot of works on improving ALS's {{recommendForAll}}
> For now, I found that it maybe futhermore optimized by
> 1, using GEMV;
> 2, directly aggregate on topK collections (srcId, Array(dstId), 
> Array(score)), instead of each element (srcId, (dstId, score));
> 3, use guava.ordering instead of BoundedPriorityQueue;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV



 [ 
https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33518:


Assignee: (was: Apache Spark)

> Improve performance of ML ALS recommendForAll by GEMV
> -
>
> Key: SPARK-33518
> URL: https://issues.apache.org/jira/browse/SPARK-33518
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> There were a lot of works on improving ALS's {{recommendForAll}}
> For now, I found that it maybe futhermore optimized by
> 1, using GEMV;
> 2, directly aggregate on topK collections (srcId, Array(dstId), 
> Array(score)), instead of each element (srcId, (dstId, score));
> 3, use guava.ordering instead of BoundedPriorityQueue;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33518) Improve performance of ML ALS recommendForAll by GEMV



[ 
https://issues.apache.org/jira/browse/SPARK-33518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237280#comment-17237280
 ] 

Apache Spark commented on SPARK-33518:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/30468

> Improve performance of ML ALS recommendForAll by GEMV
> -
>
> Key: SPARK-33518
> URL: https://issues.apache.org/jira/browse/SPARK-33518
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> There were a lot of works on improving ALS's {{recommendForAll}}
> For now, I found that it maybe futhermore optimized by
> 1, using GEMV;
> 2, directly aggregate on topK collections (srcId, Array(dstId), 
> Array(score)), instead of each element (srcId, (dstId, score));
> 3, use guava.ordering instead of BoundedPriorityQueue;
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33479) Make apiKey of docsearch configurable



[ 
https://issues.apache.org/jira/browse/SPARK-33479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237318#comment-17237318
 ] 

Apache Spark commented on SPARK-33479:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30469

> Make apiKey of docsearch configurable
> -
>
> Key: SPARK-33479
> URL: https://issues.apache.org/jira/browse/SPARK-33479
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.1.0
>
>
> After https://github.com/apache/spark/pull/30292, our Spark documentation 
> site supports searching. 
> However, the default API key always points to the latest release doc. We have 
> to set different API keys for different releases. Otherwise, the search 
> results are always based on the latest 
> documentation(https://spark.apache.org/docs/latest/) even when visiting the
> documentation of previous releases.
> As per discussion in 
> https://github.com/apache/spark/pull/30292#issuecomment-725613417, we should 
> make the API key configurable and avoid hardcoding in the HTML template.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33479) Make apiKey of docsearch configurable



[ 
https://issues.apache.org/jira/browse/SPARK-33479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237319#comment-17237319
 ] 

Apache Spark commented on SPARK-33479:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30469

> Make apiKey of docsearch configurable
> -
>
> Key: SPARK-33479
> URL: https://issues.apache.org/jira/browse/SPARK-33479
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.1.0
>
>
> After https://github.com/apache/spark/pull/30292, our Spark documentation 
> site supports searching. 
> However, the default API key always points to the latest release doc. We have 
> to set different API keys for different releases. Otherwise, the search 
> results are always based on the latest 
> documentation(https://spark.apache.org/docs/latest/) even when visiting the
> documentation of previous releases.
> As per discussion in 
> https://github.com/apache/spark/pull/30292#issuecomment-725613417, we should 
> make the API key configurable and avoid hardcoding in the HTML template.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32792) Improve in filter pushdown for ParquetFilters

2020-11-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32792:

Description: 
Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` 
maximum value  when its values exceeds 
`spark.sql.parquet.pushdown.inFilterThreshold`. For example:

```sql
SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15)
```

We will push down `id >= 1 and id <= 15`.

  was:
[https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L602]
{code:scala}
  case sources.In(name, values) if canMakeFilterOn(name, values.head)
&& values.distinct.length <= pushDownInFilterThreshold =>
values.distinct.flatMap { v =>
{code}

*distinct* is expensive
 


> Improve in filter pushdown for ParquetFilters
> -
>
> Key: SPARK-32792
> URL: https://issues.apache.org/jira/browse/SPARK-32792
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` 
> maximum value  when its values exceeds 
> `spark.sql.parquet.pushdown.inFilterThreshold`. For example:
> ```sql
> SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15)
> ```
> We will push down `id >= 1 and id <= 15`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33519) Batch UDF in scala

Gaetan created SPARK-33519:
--

 Summary: Batch UDF in scala
 Key: SPARK-33519
 URL: https://issues.apache.org/jira/browse/SPARK-33519
 Project: Spark
  Issue Type: Wish
  Components: Optimizer, Spark Core
Affects Versions: 3.0.1, 3.0.0
Reporter: Gaetan


Hello,

Contrary to Python, there is only one type of Scala UDF, that let us define a 
Scala function to apply on a set of Column and which is called +for each row+. 
One advantage of Scala UDF over mapPartitions is that Catalyst is able to see 
what are the inputs which are then used for column pruning, predicate pushdown 
and other optimization rules. But in some use cases, there can be a setup phase 
that we only want to execute once per worker right before processing inputs. 
For such use cases, Scala UDF is not well suited and mapPartitions is used 
instead like this:

 
{code:java}
ds.mapPartitions(
  it => {
setup()
process(it)
  }
){code}
After having looked at the code, I figured that Python UDF are implemented via 
query plans that retrieve a RDD via their children and that call mapPartitions 
of that RDD to work with batches of inputs. These query plans are generated by 
Catalyst by extracting Python UDFs (rule ExtractPythonUDFs).

 

Like for Python UDFs, we could implement Scala batch UDFs with query plans to 
work with a batch of inputs instead of one input. What do you think ?

Here is a very small description of one of our use cases of Spark that could 
greatly benefit from Scala batch UDFs:

We are using Spark to distribute some computation run in C#. To do so, we call 
the method mapPartitions of the DataFrame that represents our data. Inside 
mapPartitions, we:
 * First connect to the C# process
 * Then iterate over the inputs by sending each input to the C# process and by 
getting back the results.

The use of mapPartitions was motivated by the setup (connection to the C# 
process) that happens for each partition.

Now that we have a first working version, we would like to improve it by 
limiting the columns to read. We don't want to select columns that are required 
by our computation right before the mapPartitions because it would result in 
filtering out columns that could be required by other transformations in the 
workflow. Instead, we would like to take advantage of Catalyst for column 
pruning, predict pushdowns and other optimization rules. Using a Scala UDF to 
replace the mapPartitions would not be efficient because we would connect to 
the C# process for each row. An alternative would be a Scala "batch" UDF which 
would be applied on the columns that are needed for our computation, to take 
advantage of Catalyst and its optimizing rules, and which input would be an 
iterator like mapPartitions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33495) jcl-over-slf4j conflicts with commons-logging.jar



[ 
https://issues.apache.org/jira/browse/SPARK-33495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237402#comment-17237402
 ] 

Apache Spark commented on SPARK-33495:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30470

> jcl-over-slf4j conflicts with commons-logging.jar
> -
>
> Key: SPARK-33495
> URL: https://issues.apache.org/jira/browse/SPARK-33495
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: lrz
>Priority: Minor
>
> spark had introduces jcl-over-slf4j as the bridge between commons-logging and 
> slf4j. And refer to:
> [https://jira.qos.ch/browse/SLF4J-250?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel]
> [http://www.slf4j.org/legacy.html]
> because jcl-over-slf4j.jar contains duplicates classes with 
> commons-logging.jar, so it's better to remove the dependency of 
> commons-logging.
> And we also find one deadlock issue cause by jcl-over-slf4j.jar and 
> commons-logging.jar coexistence. this issue happen during the cinit of 
> LogFactory and SLF4JlogFactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33495) jcl-over-slf4j conflicts with commons-logging.jar



 [ 
https://issues.apache.org/jira/browse/SPARK-33495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33495:


Assignee: (was: Apache Spark)

> jcl-over-slf4j conflicts with commons-logging.jar
> -
>
> Key: SPARK-33495
> URL: https://issues.apache.org/jira/browse/SPARK-33495
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: lrz
>Priority: Minor
>
> spark had introduces jcl-over-slf4j as the bridge between commons-logging and 
> slf4j. And refer to:
> [https://jira.qos.ch/browse/SLF4J-250?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel]
> [http://www.slf4j.org/legacy.html]
> because jcl-over-slf4j.jar contains duplicates classes with 
> commons-logging.jar, so it's better to remove the dependency of 
> commons-logging.
> And we also find one deadlock issue cause by jcl-over-slf4j.jar and 
> commons-logging.jar coexistence. this issue happen during the cinit of 
> LogFactory and SLF4JlogFactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33495) jcl-over-slf4j conflicts with commons-logging.jar



 [ 
https://issues.apache.org/jira/browse/SPARK-33495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33495:


Assignee: Apache Spark

> jcl-over-slf4j conflicts with commons-logging.jar
> -
>
> Key: SPARK-33495
> URL: https://issues.apache.org/jira/browse/SPARK-33495
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: lrz
>Assignee: Apache Spark
>Priority: Minor
>
> spark had introduces jcl-over-slf4j as the bridge between commons-logging and 
> slf4j. And refer to:
> [https://jira.qos.ch/browse/SLF4J-250?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel]
> [http://www.slf4j.org/legacy.html]
> because jcl-over-slf4j.jar contains duplicates classes with 
> commons-logging.jar, so it's better to remove the dependency of 
> commons-logging.
> And we also find one deadlock issue cause by jcl-over-slf4j.jar and 
> commons-logging.jar coexistence. this issue happen during the cinit of 
> LogFactory and SLF4JlogFactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33495) jcl-over-slf4j conflicts with commons-logging.jar



[ 
https://issues.apache.org/jira/browse/SPARK-33495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237403#comment-17237403
 ] 

Apache Spark commented on SPARK-33495:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30470

> jcl-over-slf4j conflicts with commons-logging.jar
> -
>
> Key: SPARK-33495
> URL: https://issues.apache.org/jira/browse/SPARK-33495
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: lrz
>Priority: Minor
>
> spark had introduces jcl-over-slf4j as the bridge between commons-logging and 
> slf4j. And refer to:
> [https://jira.qos.ch/browse/SLF4J-250?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel]
> [http://www.slf4j.org/legacy.html]
> because jcl-over-slf4j.jar contains duplicates classes with 
> commons-logging.jar, so it's better to remove the dependency of 
> commons-logging.
> And we also find one deadlock issue cause by jcl-over-slf4j.jar and 
> commons-logging.jar coexistence. this issue happen during the cinit of 
> LogFactory and SLF4JlogFactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.



 [ 
https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32221:


Assignee: Apache Spark

> Avoid possible errors due to incorrect file size or type supplied in spark 
> conf.
> 
>
> Key: SPARK-32221
> URL: https://issues.apache.org/jira/browse/SPARK-32221
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the SPARK_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which 
> allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s 
> clusters, then we can hope to overcome this limitation. e.g. 
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for 
> higher limit on each entry.
> Even if that does not happen, there are other ways to overcome this 
> limitation, for example, we can have config files split across multiple 
> configMaps. We need to discuss, and prioritise, this issue takes the 
> straightforward approach of skipping files that cannot be accommodated within 
> 1.5MiB limit and WARNING the user about the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.



[ 
https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237448#comment-17237448
 ] 

Apache Spark commented on SPARK-32221:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30472

> Avoid possible errors due to incorrect file size or type supplied in spark 
> conf.
> 
>
> Key: SPARK-32221
> URL: https://issues.apache.org/jira/browse/SPARK-32221
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the SPARK_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which 
> allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s 
> clusters, then we can hope to overcome this limitation. e.g. 
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for 
> higher limit on each entry.
> Even if that does not happen, there are other ways to overcome this 
> limitation, for example, we can have config files split across multiple 
> configMaps. We need to discuss, and prioritise, this issue takes the 
> straightforward approach of skipping files that cannot be accommodated within 
> 1.5MiB limit and WARNING the user about the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.



 [ 
https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32221:


Assignee: (was: Apache Spark)

> Avoid possible errors due to incorrect file size or type supplied in spark 
> conf.
> 
>
> Key: SPARK-32221
> URL: https://issues.apache.org/jira/browse/SPARK-32221
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the SPARK_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which 
> allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s 
> clusters, then we can hope to overcome this limitation. e.g. 
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for 
> higher limit on each entry.
> Even if that does not happen, there are other ways to overcome this 
> limitation, for example, we can have config files split across multiple 
> configMaps. We need to discuss, and prioritise, this issue takes the 
> straightforward approach of skipping files that cannot be accommodated within 
> 1.5MiB limit and WARNING the user about the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.



[ 
https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237449#comment-17237449
 ] 

Apache Spark commented on SPARK-32221:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/30472

> Avoid possible errors due to incorrect file size or type supplied in spark 
> conf.
> 
>
> Key: SPARK-32221
> URL: https://issues.apache.org/jira/browse/SPARK-32221
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the SPARK_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which 
> allows for 1.5MiB limit). Once etcd is upgraded in all the popular k8s 
> clusters, then we can hope to overcome this limitation. e.g. 
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for 
> higher limit on each entry.
> Even if that does not happen, there are other ways to overcome this 
> limitation, for example, we can have config files split across multiple 
> configMaps. We need to discuss, and prioritise, this issue takes the 
> straightforward approach of skipping files that cannot be accommodated within 
> 1.5MiB limit and WARNING the user about the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33520) make CrossValidator/TrainValidateSplit support Python backend estimator/model

2020-11-23 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-33520:
--

 Summary: make CrossValidator/TrainValidateSplit support Python 
backend estimator/model
 Key: SPARK-33520
 URL: https://issues.apache.org/jira/browse/SPARK-33520
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: Weichen Xu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33520) make CrossValidator/TrainValidateSplit support Python backend estimator/model



 [ 
https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33520:


Assignee: Apache Spark

> make CrossValidator/TrainValidateSplit support Python backend estimator/model
> -
>
> Key: SPARK-33520
> URL: https://issues.apache.org/jira/browse/SPARK-33520
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33520) make CrossValidator/TrainValidateSplit support Python backend estimator/model



[ 
https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237455#comment-17237455
 ] 

Apache Spark commented on SPARK-33520:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/30471

> make CrossValidator/TrainValidateSplit support Python backend estimator/model
> -
>
> Key: SPARK-33520
> URL: https://issues.apache.org/jira/browse/SPARK-33520
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33520) make CrossValidator/TrainValidateSplit support Python backend estimator/model



 [ 
https://issues.apache.org/jira/browse/SPARK-33520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33520:


Assignee: (was: Apache Spark)

> make CrossValidator/TrainValidateSplit support Python backend estimator/model
> -
>
> Key: SPARK-33520
> URL: https://issues.apache.org/jira/browse/SPARK-33520
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33519) Batch UDF in scala



 [ 
https://issues.apache.org/jira/browse/SPARK-33519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaetan updated SPARK-33519:
---
Issue Type: New Feature  (was: Wish)

> Batch UDF in scala
> --
>
> Key: SPARK-33519
> URL: https://issues.apache.org/jira/browse/SPARK-33519
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Gaetan
>Priority: Major
>
> Hello,
> Contrary to Python, there is only one type of Scala UDF, that let us define a 
> Scala function to apply on a set of Column and which is called +for each 
> row+. One advantage of Scala UDF over mapPartitions is that Catalyst is able 
> to see what are the inputs which are then used for column pruning, predicate 
> pushdown and other optimization rules. But in some use cases, there can be a 
> setup phase that we only want to execute once per worker right before 
> processing inputs. For such use cases, Scala UDF is not well suited and 
> mapPartitions is used instead like this:
>  
> {code:java}
> ds.mapPartitions(
>   it => {
> setup()
> process(it)
>   }
> ){code}
> After having looked at the code, I figured that Python UDF are implemented 
> via query plans that retrieve a RDD via their children and that call 
> mapPartitions of that RDD to work with batches of inputs. These query plans 
> are generated by Catalyst by extracting Python UDFs (rule ExtractPythonUDFs).
>  
> Like for Python UDFs, we could implement Scala batch UDFs with query plans to 
> work with a batch of inputs instead of one input. What do you think ?
> Here is a very small description of one of our use cases of Spark that could 
> greatly benefit from Scala batch UDFs:
> We are using Spark to distribute some computation run in C#. To do so, we 
> call the method mapPartitions of the DataFrame that represents our data. 
> Inside mapPartitions, we:
>  * First connect to the C# process
>  * Then iterate over the inputs by sending each input to the C# process and 
> by getting back the results.
> The use of mapPartitions was motivated by the setup (connection to the C# 
> process) that happens for each partition.
> Now that we have a first working version, we would like to improve it by 
> limiting the columns to read. We don't want to select columns that are 
> required by our computation right before the mapPartitions because it would 
> result in filtering out columns that could be required by other 
> transformations in the workflow. Instead, we would like to take advantage of 
> Catalyst for column pruning, predict pushdowns and other optimization rules. 
> Using a Scala UDF to replace the mapPartitions would not be efficient because 
> we would connect to the C# process for each row. An alternative would be a 
> Scala "batch" UDF which would be applied on the columns that are needed for 
> our computation, to take advantage of Catalyst and its optimizing rules, and 
> which input would be an iterator like mapPartitions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33519) Batch UDF in scala

[
https://issues.apache.org/jira/browse/SPARK-33519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gaetan updated SPARK-33519:
---
Description:
Hello,

Contrary to Python, there is only one type of Scala UDF, that let us define a
Scala function to apply on a set of Column and which is called +for each row+.
One advantage of Scala UDF over mapPartitions is that Catalyst is able to see
what are the inputs which are then used for column pruning, predicate pushdown
and other optimization rules. But in some use cases, there can be a setup phase
that we only want to execute once per worker right before processing inputs.
For such use cases, Scala UDF is not well suited and mapPartitions is used
instead like this:

{code:java}
ds.mapPartitions(
it => {
setup()
process(it)
}
){code}
After having looked at the code, I figured that Python UDF are implemented via
query plans that retrieve a RDD via their children and that call mapPartitions
of that RDD to work with batches of inputs. These query plans are generated by
Catalyst by extracting Python UDFs (rule ExtractPythonUDFs).

*Implementation details*: we could implement a new Expression ScalaBatchUDF and
add a boolean isBatch to Expression that tells whether an Expression is batch
or not. SparkPlan SelectExec, FilterExec (and probably more) would be modified
to handle batch Expression:
* Generated code would include code that call batch Expression with batch of
inputs instead of one single input.
* doExecute() method will call batch Expression with batch of inputs instead
of one single input.

A SparkPlan could be composed of "single" Expressions and batch expressions. It
is a first idea that would need to be refined. What do you think ?

Here is a very small description of *one of our use cases of Spark* that could
greatly benefit from Scala batch UDFs:

We are using Spark to distribute some computation run in C#. To do so, we call
the method mapPartitions of the DataFrame that represents our data. Inside
mapPartitions, we:
* First connect to the C# process
* Then iterate over the inputs by sending each input to the C# process and by
getting back the results.

The use of mapPartitions was motivated by the setup (connection to the C#
process) that happens for each partition.

Now that we have a first working version, we would like to improve it by
limiting the columns to read. We don't want to select columns that are required
by our computation right before the mapPartitions because it would result in
filtering out columns that could be required by other transformations in the
workflow. Instead, we would like to take advantage of Catalyst for column
pruning, predict pushdowns and other optimization rules. Using a Scala UDF to
replace the mapPartitions would not be efficient because we would connect to
the C# process for each row. An alternative would be a Scala "batch" UDF which
would be applied on the columns that are needed for our computation, to take
advantage of Catalyst and its optimizing rules, and which input would be an
iterator like mapPartitions.

was:
Hello,

Like for Python UDFs, we could implement Scala batch UDFs with query plans to
work with a batch of inputs instead of one input. What do you think ?

Here is a very small description of one of our use cases of Spark that could
greatly benefit from Scala batch UDFs:

The use of mapPartitions was motivated by the setup (connection to the C#
process) that happens for each partition.

Now that we have a first working version, we would like to improve it by
limiting the columns to read. We don't want to select columns that are required

[jira] [Assigned] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog



 [ 
https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33430:


Assignee: Apache Spark

> Support namespaces in JDBC v2 Table Catalog
> ---
>
> Key: SPARK-33430
> URL: https://issues.apache.org/jira/browse/SPARK-33430
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> When I extend JDBCTableCatalogSuite by 
> org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance:
> {code:scala}
> import org.apache.spark.sql.execution.command.v2.ShowTablesSuite
> class JDBCTableCatalogSuite extends ShowTablesSuite {
>   override def version: String = "JDBC V2"
>   override def catalog: String = "h2"
> ...
> {code}
> some tests from JDBCTableCatalogSuite fail with:
> {code}
> [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 
> seconds, 502 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does 
> not support namespaces;
> [info]   at 
> org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog



[ 
https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237592#comment-17237592
 ] 

Apache Spark commented on SPARK-33430:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30473

> Support namespaces in JDBC v2 Table Catalog
> ---
>
> Key: SPARK-33430
> URL: https://issues.apache.org/jira/browse/SPARK-33430
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> When I extend JDBCTableCatalogSuite by 
> org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance:
> {code:scala}
> import org.apache.spark.sql.execution.command.v2.ShowTablesSuite
> class JDBCTableCatalogSuite extends ShowTablesSuite {
>   override def version: String = "JDBC V2"
>   override def catalog: String = "h2"
> ...
> {code}
> some tests from JDBCTableCatalogSuite fail with:
> {code}
> [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 
> seconds, 502 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does 
> not support namespaces;
> [info]   at 
> org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33430) Support namespaces in JDBC v2 Table Catalog



 [ 
https://issues.apache.org/jira/browse/SPARK-33430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33430:


Assignee: (was: Apache Spark)

> Support namespaces in JDBC v2 Table Catalog
> ---
>
> Key: SPARK-33430
> URL: https://issues.apache.org/jira/browse/SPARK-33430
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> When I extend JDBCTableCatalogSuite by 
> org.apache.spark.sql.execution.command.v2.ShowTablesSuite, for instance:
> {code:scala}
> import org.apache.spark.sql.execution.command.v2.ShowTablesSuite
> class JDBCTableCatalogSuite extends ShowTablesSuite {
>   override def version: String = "JDBC V2"
>   override def catalog: String = "h2"
> ...
> {code}
> some tests from JDBCTableCatalogSuite fail with:
> {code}
> [info] - SHOW TABLES JDBC V2: show an existing table *** FAILED *** (2 
> seconds, 502 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Cannot use catalog h2: does 
> not support namespaces;
> [info]   at 
> org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:83)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.ResolveCatalogs$$anonfun$apply$1.applyOrElse(ResolveCatalogs.scala:34)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33519) Batch UDF in scala

[
https://issues.apache.org/jira/browse/SPARK-33519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gaetan updated SPARK-33519:
---
Description:
Hello,

*Implementation details*:

1. we could implement a new Expression ScalaBatchUDF and add a boolean isBatch
to Expression that tells whether an Expression is batch or not. SparkPlan
SelectExec, FilterExec (and probably more) would be modified to handle batch
Expression:
* Generated code would include code that call batch Expression with batch of
inputs instead of one single input.
* doExecute() method will call batch Expression with batch of inputs instead
of one single input.

A SparkPlan could be composed of "single" Expressions and batch expressions. It
is a first idea that would need to be refined.

2. Another solution could also be to do as for Python UDFs: a batch UDF,
implemented as Expression, is extracted from the query plan it belongs to and
transformed into a query plan ScalaBatchUDFExec (which become child of the
query plan that the batch UDF belongs to).

What do you think ?

Here is a very small description of *one of our use cases of Spark* that could
greatly benefit from Scala batch UDFs:

The use of mapPartitions was motivated by the setup (connection to the C#
process) that happens for each partition.

was:
Hello,

A SparkPlan could be composed of "single" Expressions and batch expre

[jira] [Created] (SPARK-33521) Universal type conversion of V2 partition values

2020-11-23 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-33521:
--

 Summary: Universal type conversion of V2 partition values
 Key: SPARK-33521
 URL: https://issues.apache.org/jira/browse/SPARK-33521
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Support other types while resolving partition specs in

https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33521) Universal type conversion of V2 partition values



 [ 
https://issues.apache.org/jira/browse/SPARK-33521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33521:


Assignee: (was: Apache Spark)

> Universal type conversion of V2 partition values
> 
>
> Key: SPARK-33521
> URL: https://issues.apache.org/jira/browse/SPARK-33521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Support other types while resolving partition specs in
> https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33521) Universal type conversion of V2 partition values



 [ 
https://issues.apache.org/jira/browse/SPARK-33521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33521:


Assignee: Apache Spark

> Universal type conversion of V2 partition values
> 
>
> Key: SPARK-33521
> URL: https://issues.apache.org/jira/browse/SPARK-33521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Support other types while resolving partition specs in
> https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33521) Universal type conversion of V2 partition values



[ 
https://issues.apache.org/jira/browse/SPARK-33521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237677#comment-17237677
 ] 

Apache Spark commented on SPARK-33521:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30474

> Universal type conversion of V2 partition values
> 
>
> Key: SPARK-33521
> URL: https://issues.apache.org/jira/browse/SPARK-33521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Support other types while resolving partition specs in
> https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33521) Universal type conversion of V2 partition values



[ 
https://issues.apache.org/jira/browse/SPARK-33521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237678#comment-17237678
 ] 

Apache Spark commented on SPARK-33521:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30474

> Universal type conversion of V2 partition values
> 
>
> Key: SPARK-33521
> URL: https://issues.apache.org/jira/browse/SPARK-33521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Support other types while resolving partition specs in
> https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f1cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle

2020-11-23 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32918.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30163
[https://github.com/apache/spark/pull/30163]

> RPC implementation to support control plane coordination for push-based 
> shuffle
> ---
>
> Key: SPARK-32918
> URL: https://issues.apache.org/jira/browse/SPARK-32918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Fix For: 3.1.0
>
>
> RPCs to facilitate coordination of shuffle map/reduce stages. Notifications 
> to external shuffle services to finalize shuffle block merge for a given 
> shuffle are carried through this RPC. It also respond back the metadata about 
> a merged shuffle partition back to the caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32918) RPC implementation to support control plane coordination for push-based shuffle

2020-11-23 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-32918:
---

Assignee: Ye Zhou

> RPC implementation to support control plane coordination for push-based 
> shuffle
> ---
>
> Key: SPARK-32918
> URL: https://issues.apache.org/jira/browse/SPARK-32918
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Ye Zhou
>Priority: Major
> Fix For: 3.1.0
>
>
> RPCs to facilitate coordination of shuffle map/reduce stages. Notifications 
> to external shuffle services to finalize shuffle block merge for a given 
> shuffle are carried through this RPC. It also respond back the metadata about 
> a merged shuffle partition back to the caller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView

2020-11-23 Thread Terry Kim (Jira)

Terry Kim created SPARK-33522:
-

 Summary: Improve exception messages while handling 
UnresolvedTableOrView
 Key: SPARK-33522
 URL: https://issues.apache.org/jira/browse/SPARK-33522
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


Improve exception messages while handling UnresolvedTableOrView.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView



[ 
https://issues.apache.org/jira/browse/SPARK-33522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237709#comment-17237709
 ] 

Apache Spark commented on SPARK-33522:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30475

> Improve exception messages while handling UnresolvedTableOrView
> ---
>
> Key: SPARK-33522
> URL: https://issues.apache.org/jira/browse/SPARK-33522
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Improve exception messages while handling UnresolvedTableOrView.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView



 [ 
https://issues.apache.org/jira/browse/SPARK-33522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33522:


Assignee: Apache Spark

> Improve exception messages while handling UnresolvedTableOrView
> ---
>
> Key: SPARK-33522
> URL: https://issues.apache.org/jira/browse/SPARK-33522
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> Improve exception messages while handling UnresolvedTableOrView.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33522) Improve exception messages while handling UnresolvedTableOrView



 [ 
https://issues.apache.org/jira/browse/SPARK-33522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33522:


Assignee: (was: Apache Spark)

> Improve exception messages while handling UnresolvedTableOrView
> ---
>
> Key: SPARK-33522
> URL: https://issues.apache.org/jira/browse/SPARK-33522
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Improve exception messages while handling UnresolvedTableOrView.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark

2020-11-23 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-33523:
---

 Summary: Add predicate related benchmark to 
SubExprEliminationBenchmark
 Key: SPARK-33523
 URL: https://issues.apache.org/jira/browse/SPARK-33523
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


This is for the task to add predicate related benchmark to 
SubExprEliminationBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13



[ 
https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237740#comment-17237740
 ] 

Dongjoon Hyun commented on SPARK-32808:
---

`sql/core` module seems to be broken. I will file a new Jira.
{code:java}
$ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
...
[info] - SPARK-31255: Project a metadata column *** FAILED *** (96 milliseconds)
[info] - SPARK-31255: Projects data column when metadata column has the same 
name *** FAILED *** (77 milliseconds){code}

> Pass all `sql/core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32808
> URL: https://issues.apache.org/jira/browse/SPARK-32808
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.1.0
>
>
> Now there are  319 TESTS FAILED based on commit 
> `f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
> {code:java}
> Run completed in 1 hour, 20 minutes, 25 seconds.
> Total number of tests run: 8485
> Suites: completed 357, aborted 0
> Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
> *** 319 TESTS FAILED ***
> {code}
>  
> There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
> TPCDS_XXX_PlanStabilityWithStatsSuite:
>  * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
>  * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
>  * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
>  * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
>  * TPCDSModifiedPlanStabilitySuite(21 FAILED)
>  * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)
>  
> Other 26 FAILED cases as follow:
>  * StreamingAggregationSuite
>  ** count distinct - state format version 1 
>  ** count distinct - state format version 2 
>  * GeneratorFunctionSuite
>  ** explode and other columns
>  ** explode_outer and other columns
>  * UDFSuite
>  ** SPARK-26308: udf with complex types of decimal
>  ** SPARK-32459: UDF should not fail on WrappedArray
>  * SQLQueryTestSuite
>  ** decimalArithmeticOperations.sql
>  ** postgreSQL/aggregates_part2.sql
>  ** ansi/decimalArithmeticOperations.sql 
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
>  * WholeStageCodegenSuite
>  ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
>  * DataFrameSuite:
>  ** explode
>  ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
> overflow
>  ** Star Expansion - ds.explode should fail with a meaningful message if it 
> takes a star
>  * DataStreamReaderWriterSuite
>  ** SPARK-18510: use user specified types for partition columns in file 
> sources
>  * OrcV1QuerySuite\OrcV2QuerySuite
>  ** Simple selection form ORC table * 2
>  * ExpressionsSchemaSuite
>  ** Check schemas for expression examples
>  * DataFrameStatSuite
>  ** SPARK-28818: Respect original column nullability in `freqItems`
>  * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite
>  ** SPARK-4228 DataFrame to JSON * 3
>  ** backward compatibility * 3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33524) Fix DataSourceV2SQLSuite in Scala 2.13

Dongjoon Hyun created SPARK-33524:
-

 Summary: Fix DataSourceV2SQLSuite in Scala 2.13
 Key: SPARK-33524
 URL: https://issues.apache.org/jira/browse/SPARK-33524
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


`sql/core` module seems to be broken. I will file a new Jira.
{code:java}
$ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
...
[info] - SPARK-31255: Project a metadata column *** FAILED *** (96 milliseconds)
[info] - SPARK-31255: Projects data column when metadata column has the same 
name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13



[ 
https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237741#comment-17237741
 ] 

Dongjoon Hyun commented on SPARK-32808:
---

I filed SPARK-33524 .

> Pass all `sql/core` module UTs in Scala 2.13
> 
>
> Key: SPARK-32808
> URL: https://issues.apache.org/jira/browse/SPARK-32808
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.1.0
>
>
> Now there are  319 TESTS FAILED based on commit 
> `f5360e761ef161f7e04526b59a4baf53f1cf8cd5`
> {code:java}
> Run completed in 1 hour, 20 minutes, 25 seconds.
> Total number of tests run: 8485
> Suites: completed 357, aborted 0
> Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
> *** 319 TESTS FAILED ***
> {code}
>  
> There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and 
> TPCDS_XXX_PlanStabilityWithStatsSuite:
>  * TPCDSV2_7_PlanStabilitySuite(33 FAILED)
>  * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED)
>  * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED)
>  * TPCDSV1_4_PlanStabilitySuite(92 FAILED)
>  * TPCDSModifiedPlanStabilitySuite(21 FAILED)
>  * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED)
>  
> Other 26 FAILED cases as follow:
>  * StreamingAggregationSuite
>  ** count distinct - state format version 1 
>  ** count distinct - state format version 2 
>  * GeneratorFunctionSuite
>  ** explode and other columns
>  ** explode_outer and other columns
>  * UDFSuite
>  ** SPARK-26308: udf with complex types of decimal
>  ** SPARK-32459: UDF should not fail on WrappedArray
>  * SQLQueryTestSuite
>  ** decimalArithmeticOperations.sql
>  ** postgreSQL/aggregates_part2.sql
>  ** ansi/decimalArithmeticOperations.sql 
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF
>  ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF
>  * WholeStageCodegenSuite
>  ** SPARK-26680: Stream in groupBy does not cause StackOverflowError
>  * DataFrameSuite:
>  ** explode
>  ** SPARK-28067: Aggregate sum should not return wrong results for decimal 
> overflow
>  ** Star Expansion - ds.explode should fail with a meaningful message if it 
> takes a star
>  * DataStreamReaderWriterSuite
>  ** SPARK-18510: use user specified types for partition columns in file 
> sources
>  * OrcV1QuerySuite\OrcV2QuerySuite
>  ** Simple selection form ORC table * 2
>  * ExpressionsSchemaSuite
>  ** Check schemas for expression examples
>  * DataFrameStatSuite
>  ** SPARK-28818: Respect original column nullability in `freqItems`
>  * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite
>  ** SPARK-4228 DataFrame to JSON * 3
>  ** backward compatibility * 3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark



[ 
https://issues.apache.org/jira/browse/SPARK-33523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237759#comment-17237759
 ] 

Apache Spark commented on SPARK-33523:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30476

> Add predicate related benchmark to SubExprEliminationBenchmark
> --
>
> Key: SPARK-33523
> URL: https://issues.apache.org/jira/browse/SPARK-33523
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> This is for the task to add predicate related benchmark to 
> SubExprEliminationBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark



 [ 
https://issues.apache.org/jira/browse/SPARK-33523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33523:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Add predicate related benchmark to SubExprEliminationBenchmark
> --
>
> Key: SPARK-33523
> URL: https://issues.apache.org/jira/browse/SPARK-33523
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> This is for the task to add predicate related benchmark to 
> SubExprEliminationBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark



 [ 
https://issues.apache.org/jira/browse/SPARK-33523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33523:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Add predicate related benchmark to SubExprEliminationBenchmark
> --
>
> Key: SPARK-33523
> URL: https://issues.apache.org/jira/browse/SPARK-33523
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> This is for the task to add predicate related benchmark to 
> SubExprEliminationBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33524) Fix DataSourceV2SQLSuite in Scala 2.13



 [ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33524:


Assignee: Apache Spark

> Fix DataSourceV2SQLSuite in Scala 2.13
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33524) Fix DataSourceV2SQLSuite in Scala 2.13



[ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237761#comment-17237761
 ] 

Apache Spark commented on SPARK-33524:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30477

> Fix DataSourceV2SQLSuite in Scala 2.13
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33524) Fix DataSourceV2SQLSuite in Scala 2.13



 [ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33524:


Assignee: (was: Apache Spark)

> Fix DataSourceV2SQLSuite in Scala 2.13
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33513) Upgrade to Scala 2.13.4



 [ 
https://issues.apache.org/jira/browse/SPARK-33513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33513.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30455
[https://github.com/apache/spark/pull/30455]

> Upgrade to Scala 2.13.4
> ---
>
> Key: SPARK-33513
> URL: https://issues.apache.org/jira/browse/SPARK-33513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33513) Upgrade to Scala 2.13.4



 [ 
https://issues.apache.org/jira/browse/SPARK-33513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33513:
-

Assignee: Dongjoon Hyun

> Upgrade to Scala 2.13.4
> ---
>
> Key: SPARK-33513
> URL: https://issues.apache.org/jira/browse/SPARK-33513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33524) Change `BucketTransform` not to use Tuple.hashCode



 [ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33524:
--
Summary: Change `BucketTransform` not to use Tuple.hashCode  (was: Fix 
DataSourceV2SQLSuite in Scala 2.13)

> Change `BucketTransform` not to use Tuple.hashCode
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33524) Change `BucketTransform` not to use Tuple.hashCode



 [ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33524:
--
Affects Version/s: 3.0.1

> Change `BucketTransform` not to use Tuple.hashCode
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33524) Change `BucketTransform` not to use Tuple.hashCode



 [ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33524:
--
Component/s: Tests

> Change `BucketTransform` not to use Tuple.hashCode
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33513) Upgrade to Scala 2.13.4



 [ 
https://issues.apache.org/jira/browse/SPARK-33513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33513:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Improvement)

> Upgrade to Scala 2.13.4
> ---
>
> Key: SPARK-33513
> URL: https://issues.apache.org/jira/browse/SPARK-33513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-33516) Upgrade Scala 2.13 from 2.13.3 to 2.13.4



 [ 
https://issues.apache.org/jira/browse/SPARK-33516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-33516.
-

> Upgrade Scala 2.13 from 2.13.3 to 2.13.4
> 
>
> Key: SPARK-33516
> URL: https://issues.apache.org/jira/browse/SPARK-33516
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> Scala 2.13.4 released（https://github.com/scala/scala/releases/tag/v2.13.4）



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33516) Upgrade Scala 2.13 from 2.13.3 to 2.13.4



 [ 
https://issues.apache.org/jira/browse/SPARK-33516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33516.
---
Resolution: Duplicate

> Upgrade Scala 2.13 from 2.13.3 to 2.13.4
> 
>
> Key: SPARK-33516
> URL: https://issues.apache.org/jira/browse/SPARK-33516
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> Scala 2.13.4 released（https://github.com/scala/scala/releases/tag/v2.13.4）



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33524) Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`



 [ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33524:
--
Summary: Change `InMemoryTable` not to use Tuple.hashCode for 
`BucketTransform`  (was: Change `BucketTransform` not to use Tuple.hashCode)

> Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2

2020-11-23 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-33525:
---

 Summary: Upgrade hive-service-rpc to 3.1.2
 Key: SPARK-33525
 URL: https://issues.apache.org/jira/browse/SPARK-33525
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


We supported Hive metastore are 0.12.0 through 3.1.2. but we supported 
hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we use 
hive-jdbc 3.x:

{noformat}
[root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u 
jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://localhost:1/default> create table t1(id int) using parquet;
Unexpected end of file when reading from HS2 server. The root cause might be 
too many concurrent connections. Please ask the administrator to check the 
number of active connections, and adjust hive.server2.thrift.max.worker.threads 
if applicable.
Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
{noformat}

{noformat}
org.apache.thrift.protocol.TProtocolException: Missing version in 
readMessageBegin, old client?
at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
{noformat}

We can upgrade hive-service-rpc to 3.1.2 to fix this issue.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2

2020-11-23 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237815#comment-17237815
 ] 

Yuming Wang commented on SPARK-33525:
-

We should handle CLI_ODBC_KEYWORDS in SparkSQLCLIService to workaround this 
issue: 
{noformat}
20/11/23 20:03:09 WARN ThriftCLIService: Error getting info:
org.apache.hive.service.cli.HiveSQLException: Unrecognized GetInfoType value: 
CLI_ODBC_KEYWORDS
at 
org.apache.hive.service.cli.session.HiveSessionImpl.getInfo(HiveSessionImpl.java:444)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:691)
at java.base/javax.security.auth.Subject.doAs(Subject.java:425)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy23.getInfo(Unknown Source)
at org.apache.hive.service.cli.CLIService.getInfo(CLIService.java:250)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIService.getInfo(SparkSQLCLIService.scala:107)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.GetInfo(ThriftCLIService.java:440)
at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetInfo.getResult(TCLIService.java:1537)
at 
org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetInfo.getResult(TCLIService.java:1522)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
{noformat}


> Upgrade hive-service-rpc to 3.1.2
> -
>
> Key: SPARK-33525
> URL: https://issues.apache.org/jira/browse/SPARK-33525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We supported Hive metastore are 0.12.0 through 3.1.2. but we supported 
> hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we 
> use hive-jdbc 3.x:
> {noformat}
> [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u 
> jdbc:hive2://localhost:1/default
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
> Driver: Hive JDBC (version 3.1.2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.2 by Apache Hive
> 0: jdbc:hive2://localhost:1/default> create table t1(id int) using 
> parquet;
> Unexpected end of file when reading from HS2 server. The root cause might be 
> too many concurrent connections. Please ask the administrator to check the 
> number of active connections, and adjust 
> hive.server2.thrift.max.worker.threads if applicable.
> Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
> {noformat}
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Missing version in 
> readMessageBegin, old client?
>   at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {noformat}
> We can upgrade hive-service-rpc to 3.1.2 to fix

[jira] [Commented] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2



[ 
https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237822#comment-17237822
 ] 

Apache Spark commented on SPARK-33525:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30478

> Upgrade hive-service-rpc to 3.1.2
> -
>
> Key: SPARK-33525
> URL: https://issues.apache.org/jira/browse/SPARK-33525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We supported Hive metastore are 0.12.0 through 3.1.2. but we supported 
> hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we 
> use hive-jdbc 3.x:
> {noformat}
> [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u 
> jdbc:hive2://localhost:1/default
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
> Driver: Hive JDBC (version 3.1.2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.2 by Apache Hive
> 0: jdbc:hive2://localhost:1/default> create table t1(id int) using 
> parquet;
> Unexpected end of file when reading from HS2 server. The root cause might be 
> too many concurrent connections. Please ask the administrator to check the 
> number of active connections, and adjust 
> hive.server2.thrift.max.worker.threads if applicable.
> Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
> {noformat}
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Missing version in 
> readMessageBegin, old client?
>   at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {noformat}
> We can upgrade hive-service-rpc to 3.1.2 to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2



 [ 
https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33525:


Assignee: (was: Apache Spark)

> Upgrade hive-service-rpc to 3.1.2
> -
>
> Key: SPARK-33525
> URL: https://issues.apache.org/jira/browse/SPARK-33525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We supported Hive metastore are 0.12.0 through 3.1.2. but we supported 
> hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we 
> use hive-jdbc 3.x:
> {noformat}
> [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u 
> jdbc:hive2://localhost:1/default
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
> Driver: Hive JDBC (version 3.1.2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.2 by Apache Hive
> 0: jdbc:hive2://localhost:1/default> create table t1(id int) using 
> parquet;
> Unexpected end of file when reading from HS2 server. The root cause might be 
> too many concurrent connections. Please ask the administrator to check the 
> number of active connections, and adjust 
> hive.server2.thrift.max.worker.threads if applicable.
> Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
> {noformat}
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Missing version in 
> readMessageBegin, old client?
>   at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {noformat}
> We can upgrade hive-service-rpc to 3.1.2 to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2



 [ 
https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33525:


Assignee: Apache Spark

> Upgrade hive-service-rpc to 3.1.2
> -
>
> Key: SPARK-33525
> URL: https://issues.apache.org/jira/browse/SPARK-33525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> We supported Hive metastore are 0.12.0 through 3.1.2. but we supported 
> hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we 
> use hive-jdbc 3.x:
> {noformat}
> [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u 
> jdbc:hive2://localhost:1/default
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
> Driver: Hive JDBC (version 3.1.2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.2 by Apache Hive
> 0: jdbc:hive2://localhost:1/default> create table t1(id int) using 
> parquet;
> Unexpected end of file when reading from HS2 server. The root cause might be 
> too many concurrent connections. Please ask the administrator to check the 
> number of active connections, and adjust 
> hive.server2.thrift.max.worker.threads if applicable.
> Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
> {noformat}
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Missing version in 
> readMessageBegin, old client?
>   at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {noformat}
> We can upgrade hive-service-rpc to 3.1.2 to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33525) Upgrade hive-service-rpc to 3.1.2



[ 
https://issues.apache.org/jira/browse/SPARK-33525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237823#comment-17237823
 ] 

Apache Spark commented on SPARK-33525:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30478

> Upgrade hive-service-rpc to 3.1.2
> -
>
> Key: SPARK-33525
> URL: https://issues.apache.org/jira/browse/SPARK-33525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We supported Hive metastore are 0.12.0 through 3.1.2. but we supported 
> hive-jdbc are 0.12.0 through 2.3.7. It will throw TProtocolException if we 
> use hive-jdbc 3.x:
> {noformat}
> [root@spark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u 
> jdbc:hive2://localhost:1/default
> Connecting to jdbc:hive2://localhost:1/default
> Connected to: Spark SQL (version 3.1.0-SNAPSHOT)
> Driver: Hive JDBC (version 3.1.2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.2 by Apache Hive
> 0: jdbc:hive2://localhost:1/default> create table t1(id int) using 
> parquet;
> Unexpected end of file when reading from HS2 server. The root cause might be 
> too many concurrent connections. Please ask the administrator to check the 
> number of active connections, and adjust 
> hive.server2.thrift.max.worker.threads if applicable.
> Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0)
> {noformat}
> {noformat}
> org.apache.thrift.protocol.TProtocolException: Missing version in 
> readMessageBegin, old client?
>   at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {noformat}
> We can upgrade hive-service-rpc to 3.1.2 to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33524) Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`



 [ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33524:
-

Assignee: Dongjoon Hyun

> Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33524) Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`



 [ 
https://issues.apache.org/jira/browse/SPARK-33524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33524.
---
Fix Version/s: 3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30477
[https://github.com/apache/spark/pull/30477]

> Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform`
> --
>
> Key: SPARK-33524
> URL: https://issues.apache.org/jira/browse/SPARK-33524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0, 3.0.2
>
>
> `sql/core` module seems to be broken. I will file a new Jira.
> {code:java}
> $ build/sbt "sql/testOnly *.DataSourceV2SQLSuite" -Pscala-2.13
> ...
> [info] - SPARK-31255: Project a metadata column *** FAILED *** (96 
> milliseconds)
> [info] - SPARK-31255: Projects data column when metadata column has the same 
> name *** FAILED *** (77 milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33508) Why the user can not assign other `key.deserializer` and why it should be alway `ByteArrayDeserializer`?



 [ 
https://issues.apache.org/jira/browse/SPARK-33508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33508.
--
Resolution: Invalid

Let's ask questions into mailing lists before filing it as an issue. See also 
http://spark.apache.org/contributing.html

> Why the user can not assign other `key.deserializer` and why it should be 
> alway `ByteArrayDeserializer`?
> 
>
> Key: SPARK-33508
> URL: https://issues.apache.org/jira/browse/SPARK-33508
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Sayed Mohammad Hossein Torabi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33501) Encoding is not working if multiLine option is true.



 [ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33501:
-
Description: 
If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we read 
with multiLine false  and encoding with "ISO-8859-1" thne we are getting  value 
like {color:#ff}AUTO EL*É*TRICA{color}

Below is the code we are using

{code}
spark.read().option("header", "true").option("inferSchema", 
true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
{code}

Sample file is attached in attachement 

  was:
If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we read 
with multiLine false  and encoding with "ISO-8859-1" thne we are getting  value 
like {color:#ff}AUTO EL*É*TRICA{color}

Below is the code we are using

Dataset dataset1 = SparkUtil.getSparkSession().read().Dataset 
dataset1 = SparkUtil.getSparkSession().read(). option("header", "true"). 
option("inferSchema", true). option("delimiter", ";") .option("quote", "\"") 
.option("multiLine", true) .option("encoding", "ISO-8859-1") .csv("file path");

dataset1.show();

Sample file is attached in attachement 


> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.



[ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237851#comment-17237851
 ] 

Hyukjin Kwon commented on SPARK-33501:
--

I can't reproduce this:

{code}
scala> spark.read.option("header", "true").option("inferSchema", 
true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
+-+---+
|   [DS_CANAL]|[DS_FORMULARIO]|
+-+---+
|AUTO ELÉTRICA|Shop. de Preço 60Ah|
|AUTO ELÉTRICA|Shop. de Preço 60Ah|
|AUTO ELÉTRICA|Shop. de Preço 60Ah|
|AUTO ELÉTRICA|Shop. de Preço 60Ah|
|AUTO ELÉTRICA|Shop. de Preço 60Ah|
+-+---+
{code}

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33501) Encoding is not working if multiLine option is true.



 [ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33501.
--
Resolution: Cannot Reproduce

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33489) Support null for conversion from and to Arrow type



[ 
https://issues.apache.org/jira/browse/SPARK-33489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237853#comment-17237853
 ] 

Hyukjin Kwon commented on SPARK-33489:
--

cc [~bryanc] FYI. Does Arrow support null type?

> Support null for conversion from and to Arrow type
> --
>
> Key: SPARK-33489
> URL: https://issues.apache.org/jira/browse/SPARK-33489
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Yuya Kanai
>Priority: Minor
>
> I got below error when using from_arrow_type() in pyspark.sql.pandas.types
> {{Unsupported type in conversion from Arrow: null}}
> I noticed NullType exists under pyspark.sql.types so it seems possible to 
> convert from pyarrow null to pyspark null type and vice versa.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33488) Re SPARK-21820. Creating Spark dataframe with carriage return/line feed leaves cr in multiline



 [ 
https://issues.apache.org/jira/browse/SPARK-33488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33488.
--
Resolution: Cannot Reproduce

> Re SPARK-21820.  Creating Spark dataframe with carriage return/line feed 
> leaves cr in multiline
> ---
>
> Key: SPARK-33488
> URL: https://issues.apache.org/jira/browse/SPARK-33488
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache 2.4.5
> Databricks 6.6
> Spark-NLP 2.6.3
>Reporter: Greg Werner
>Priority: Major
>
> In SPARK-21820 I see what seems to be the same issue reported, but marked as 
> resolved there.  Over the past few days I have battled a dataset that 
> occasionally has \r\n at the end of lines and I claim I do see this errant 
> behavior of not removing \r\n.
> In my code, I do 
> {code:java}
> // code placeholder# CSV options
> infer_schema = "false"
> first_row_is_header = "true"
> multi_line = "true"
> delimiter = ","
> # The applied options are for CSV files. For other file types, these will be 
> ignored.
> df_train = spark.read.format(train_file_type) \
>   .option("inferSchema", infer_schema) \
>   .option("header", first_row_is_header) \
>   .option("sep", delimiter) \
>   .option("multiLine", multi_line) \
>   .option("escape", '"') \
>   .load(train_file_location)
> {code}
> So I am reading in a csv file and setting multiLine to true.  However, all 
> cases where there are \r\n in the training_file, \r is left behind.  This 
> includes the header which has a column ending in \r.  The only way I have 
> been able to workaround this is to manually edit the data file to remove the 
> \r, but I do not want to do this on a case to case basis.
> Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug.
> I am using version 2.4.5 because I am using Spark-NLP which to my knowledge 
> has not been built to use 3 yet, so the version is key for me.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33488) Re SPARK-21820. Creating Spark dataframe with carriage return/line feed leaves cr in multiline



[ 
https://issues.apache.org/jira/browse/SPARK-33488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237855#comment-17237855
 ] 

Hyukjin Kwon commented on SPARK-33488:
--

In Spark now you can set lineSep option

> Re SPARK-21820.  Creating Spark dataframe with carriage return/line feed 
> leaves cr in multiline
> ---
>
> Key: SPARK-33488
> URL: https://issues.apache.org/jira/browse/SPARK-33488
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache 2.4.5
> Databricks 6.6
> Spark-NLP 2.6.3
>Reporter: Greg Werner
>Priority: Major
>
> In SPARK-21820 I see what seems to be the same issue reported, but marked as 
> resolved there.  Over the past few days I have battled a dataset that 
> occasionally has \r\n at the end of lines and I claim I do see this errant 
> behavior of not removing \r\n.
> In my code, I do 
> {code:java}
> // code placeholder# CSV options
> infer_schema = "false"
> first_row_is_header = "true"
> multi_line = "true"
> delimiter = ","
> # The applied options are for CSV files. For other file types, these will be 
> ignored.
> df_train = spark.read.format(train_file_type) \
>   .option("inferSchema", infer_schema) \
>   .option("header", first_row_is_header) \
>   .option("sep", delimiter) \
>   .option("multiLine", multi_line) \
>   .option("escape", '"') \
>   .load(train_file_location)
> {code}
> So I am reading in a csv file and setting multiLine to true.  However, all 
> cases where there are \r\n in the training_file, \r is left behind.  This 
> includes the header which has a column ending in \r.  The only way I have 
> been able to workaround this is to manually edit the data file to remove the 
> \r, but I do not want to do this on a case to case basis.
> Therefore, I am claiming this behavior is still present in 2.4.5 and is a bug.
> I am using version 2.4.5 because I am using Spark-NLP which to my knowledge 
> has not been built to use 3 yet, so the version is key for me.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33485) running spark application in kerbernetes,bug the application log shows yarn authentications



 [ 
https://issues.apache.org/jira/browse/SPARK-33485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33485:
-
Target Version/s:   (was: 3.0.0)

> running spark application in kerbernetes,bug the application log shows yarn 
> authentications 
> 
>
> Key: SPARK-33485
> URL: https://issues.apache.org/jira/browse/SPARK-33485
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Yuan Jiao
>Priority: Major
> Attachments: application.log, project.rar
>
>
> My spark application accessing kerberized HDFS is running in kubernetes 
> cluster， but the application log shows: "Setting 
> spark.hadoop.yarn.resourcemanager.principal to tester(which is one of my 
> kerberos principals, yet I uses the other principal joan to read HDFS files)":
> ... 
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
>  + exec /usr/bin/tini -s – /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=10.244.1.61 --deploy-mode client --properties-file 
> /opt/spark/conf/spark.properties --class WordCount 
> local:///opt/spark/jars/WordCount-1.0-SNAPSHOT.jar
>  *Setting spark.hadoop.yarn.resourcemanager.principal to tester*
> ...
> 20/11/19 04:31:28 INFO HadoopFSDelegationTokenProvider: getting token for: 
> DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1041285450_1, 
> ugi=*tester@JOANTEST* (auth:KERBEROS)]] with renewer tester
>  20/11/19 04:31:37 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 60 for 
> tester on ha-hdfs:nameservice1
>  20/11/19 04:31:37 INFO HadoopFSDelegationTokenProvider: getting token for: 
> DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1041285450_1, 
> ugi=*tester@JOANTEST* (auth:KERBEROS)]] with renewer tester@JOANTEST
>  20/11/19 04:31:37 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 61 for 
> *tester* on ha-hdfs:nameservice1
>  20/11/19 04:31:37 INFO HadoopFSDelegationTokenProvider: Renewal interval is 
> 86400073 for token HDFS_DELEGATION_TOKEN
> ...
>  20/11/19 04:31:51 INFO UserGroupInformation: *Login successful for user joan 
> using keytab file /opt/hadoop/conf/joan.keytab*
> ...
>  
> I don't know why yarn authentication is needed here?And why use the principal 
> tester for autherization? Anyone can help? Thanks !
> The log and my spark project is attached blow for reference.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33523) Add predicate related benchmark to SubExprEliminationBenchmark



 [ 
https://issues.apache.org/jira/browse/SPARK-33523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33523.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30476
[https://github.com/apache/spark/pull/30476]

> Add predicate related benchmark to SubExprEliminationBenchmark
> --
>
> Key: SPARK-33523
> URL: https://issues.apache.org/jira/browse/SPARK-33523
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> This is for the task to add predicate related benchmark to 
> SubExprEliminationBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33501) Encoding is not working if multiLine option is true.



 [ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nilesh Patil updated SPARK-33501:
-
Attachment: Screenshot from 2020-11-24 10-27-17.png

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv, Screenshot from 2020-11-24 
> 10-27-17.png
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.



[ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237869#comment-17237869
 ] 

Nilesh Patil commented on SPARK-33501:
--

[~hyukjin.kwon] Please refer attached screen shot. with multiline true & false. 
I am able to reproduce same with java api also.

 

!Screenshot from 2020-11-24 10-27-17.png!

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv, Screenshot from 2020-11-24 
> 10-27-17.png
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33501) Encoding is not working if multiLine option is true.



 [ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nilesh Patil updated SPARK-33501:
-
Affects Version/s: 2.4.3

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.3
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv, Screenshot from 2020-11-24 
> 10-27-17.png
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.



[ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237877#comment-17237877
 ] 

Hyukjin Kwon commented on SPARK-33501:
--

Can you try with Spark 3.0?

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.3
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv, Screenshot from 2020-11-24 
> 10-27-17.png
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.



[ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237882#comment-17237882
 ] 

Nilesh Patil commented on SPARK-33501:
--

Yes In 3.0 its working. Is it possible to get fix in 2.3 or in 2.4 version ?

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.3
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv, Screenshot from 2020-11-24 
> 10-27-17.png
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.



[ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237893#comment-17237893
 ] 

Hyukjin Kwon commented on SPARK-33501:
--

2.3 is EOL. For 2.4, we could maybe think about it. You can identify the JIRA 
fixed this issue and ask me or other people to assess.

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.3
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv, Screenshot from 2020-11-24 
> 10-27-17.png
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32221) Avoid possible errors due to incorrect file size or type supplied in spark conf.

2020-11-23 Thread Prashant Sharma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-32221:

Description: 
This would avoid failures, in case the files are a bit large or a user places a 
binary file inside the SPARK_CONF_DIR.

Both of which are not supported at the moment.

The reason is, underlying etcd store does limit the size of each entry to only 
1.5 MiB.

[https://etcd.io/docs/v3.4.0/dev-guide/limit/] 

We can apply a straightforward approach of skipping files that cannot be 
accommodated within 1.5MiB limit (limit is configurable as per above link) and 
WARNING the user about the same.

For most use cases, this limit is more than sufficient, however a user may 
accidentally place a larger file and observe an unpredictable result or 
failures at run time.


  was:
This would avoid failures, in case the files are a bit large or a user places a 
binary file inside the SPARK_CONF_DIR.

Both of which are not supported at the moment.

The reason is, underlying etcd store does limit the size of each entry to only 
1 MiB( Recent versions of K8s have moved to using 3.4.x of etcd which allows 
for 1.5MiB limit). Once etcd is upgraded in all the popular k8s clusters, then 
we can hope to overcome this limitation. e.g. 
[https://etcd.io/docs/v3.4.0/dev-guide/limit/] version of etcd allows for 
higher limit on each entry.

Even if that does not happen, there are other ways to overcome this limitation, 
for example, we can have config files split across multiple configMaps. We need 
to discuss, and prioritise, this issue takes the straightforward approach of 
skipping files that cannot be accommodated within 1.5MiB limit and WARNING the 
user about the same.


> Avoid possible errors due to incorrect file size or type supplied in spark 
> conf.
> 
>
> Key: SPARK-32221
> URL: https://issues.apache.org/jira/browse/SPARK-32221
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> This would avoid failures, in case the files are a bit large or a user places 
> a binary file inside the SPARK_CONF_DIR.
> Both of which are not supported at the moment.
> The reason is, underlying etcd store does limit the size of each entry to 
> only 1.5 MiB.
> [https://etcd.io/docs/v3.4.0/dev-guide/limit/] 
> We can apply a straightforward approach of skipping files that cannot be 
> accommodated within 1.5MiB limit (limit is configurable as per above link) 
> and WARNING the user about the same.
> For most use cases, this limit is more than sufficient, however a user may 
> accidentally place a larger file and observe an unpredictable result or 
> failures at run time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33526) Add config to control if cancel invoke interrupt task on thriftserver

2020-11-23 Thread ulysses you (Jira)

ulysses you created SPARK-33526:
---

 Summary: Add config to control if cancel invoke interrupt task on 
thriftserver
 Key: SPARK-33526
 URL: https://issues.apache.org/jira/browse/SPARK-33526
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you


After [#29933|https://github.com/apache/spark/pull/29933], we support cancel 
query if timeout, but the default behavior of `SparkContext.cancelJobGroups` 
won't interrupt task and just let task finish by itself. In some case it's 
dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a 
long time after do cancel and the resource will not release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33527) Extend the function of decode

2020-11-23 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-33527:
--

 Summary: Extend the function of decode
 Key: SPARK-33527
 URL: https://issues.apache.org/jira/browse/SPARK-33527
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33527) Extend the function of decode

2020-11-23 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-33527:
---
Description: 
In Spark, decode(bin, charset) - Decodes the first argument using the second 
argument character set.

Unfortunately this is NOT what any other SQL vendor understands DECODE to do.
DECODE generally is a short hand for a simple case expression:


{code:java}
SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS 
T(c1)
=> 
(Hello),
(World)
(!)
{code}


> Extend the function of decode
> -
>
> Key: SPARK-33527
> URL: https://issues.apache.org/jira/browse/SPARK-33527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> In Spark, decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> Unfortunately this is NOT what any other SQL vendor understands DECODE to do.
> DECODE generally is a short hand for a simple case expression:
> {code:java}
> SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS 
> T(c1)
> => 
> (Hello),
> (World)
> (!)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases

2020-11-23 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-33527:
---
Summary: Extend the function of decode so as consistent with mainstream 
databases  (was: Extend the function of decode)

> Extend the function of decode so as consistent with mainstream databases
> 
>
> Key: SPARK-33527
> URL: https://issues.apache.org/jira/browse/SPARK-33527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> In Spark, decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> Unfortunately this is NOT what any other SQL vendor understands DECODE to do.
> DECODE generally is a short hand for a simple case expression:
> {code:java}
> SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS 
> T(c1)
> => 
> (Hello),
> (World)
> (!)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases



 [ 
https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33527:


Assignee: (was: Apache Spark)

> Extend the function of decode so as consistent with mainstream databases
> 
>
> Key: SPARK-33527
> URL: https://issues.apache.org/jira/browse/SPARK-33527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> In Spark, decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> Unfortunately this is NOT what any other SQL vendor understands DECODE to do.
> DECODE generally is a short hand for a simple case expression:
> {code:java}
> SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS 
> T(c1)
> => 
> (Hello),
> (World)
> (!)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases



 [ 
https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33527:


Assignee: Apache Spark

> Extend the function of decode so as consistent with mainstream databases
> 
>
> Key: SPARK-33527
> URL: https://issues.apache.org/jira/browse/SPARK-33527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> In Spark, decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> Unfortunately this is NOT what any other SQL vendor understands DECODE to do.
> DECODE generally is a short hand for a simple case expression:
> {code:java}
> SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS 
> T(c1)
> => 
> (Hello),
> (World)
> (!)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases



[ 
https://issues.apache.org/jira/browse/SPARK-33527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237932#comment-17237932
 ] 

Apache Spark commented on SPARK-33527:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/30479

> Extend the function of decode so as consistent with mainstream databases
> 
>
> Key: SPARK-33527
> URL: https://issues.apache.org/jira/browse/SPARK-33527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> In Spark, decode(bin, charset) - Decodes the first argument using the second 
> argument character set.
> Unfortunately this is NOT what any other SQL vendor understands DECODE to do.
> DECODE generally is a short hand for a simple case expression:
> {code:java}
> SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS 
> T(c1)
> => 
> (Hello),
> (World)
> (!)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33527) Extend the function of decode so as consistent with mainstream databases