date:20210205

[jira] [Assigned] (SPARK-34383) Optimize WAL commit phase on SS

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34383:


Assignee: (was: Apache Spark)

> Optimize WAL commit phase on SS
> ---
>
> Key: SPARK-34383
> URL: https://issues.apache.org/jira/browse/SPARK-34383
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> I found there're unnecessary access / expensive operation of file system in 
> WAL commit phase of SS.
> They can be optimized via caching (using driver memory a bit) & replacing FS 
> operation. This brings reduced latency per batch, especially checkpoint 
> against object store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34383) Optimize WAL commit phase on SS

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34383:


Assignee: Apache Spark

> Optimize WAL commit phase on SS
> ---
>
> Key: SPARK-34383
> URL: https://issues.apache.org/jira/browse/SPARK-34383
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> I found there're unnecessary access / expensive operation of file system in 
> WAL commit phase of SS.
> They can be optimized via caching (using driver memory a bit) & replacing FS 
> operation. This brings reduced latency per batch, especially checkpoint 
> against object store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34383) Optimize WAL commit phase on SS

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280075#comment-17280075
 ] 

Apache Spark commented on SPARK-34383:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31495

> Optimize WAL commit phase on SS
> ---
>
> Key: SPARK-34383
> URL: https://issues.apache.org/jira/browse/SPARK-34383
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> I found there're unnecessary access / expensive operation of file system in 
> WAL commit phase of SS.
> They can be optimized via caching (using driver memory a bit) & replacing FS 
> operation. This brings reduced latency per batch, especially checkpoint 
> against object store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34383) Optimize WAL commit phase on SS

2021-02-05 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-34383:


 Summary: Optimize WAL commit phase on SS
 Key: SPARK-34383
 URL: https://issues.apache.org/jira/browse/SPARK-34383
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Jungtaek Lim


I found there're unnecessary access / expensive operation of file system in WAL 
commit phase of SS.

They can be optimized via caching (using driver memory a bit) & replacing FS 
operation. This brings reduced latency per batch, especially checkpoint against 
object store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33410) Resolve SQL query reference a column by an alias

2021-02-05 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33410:

Description: 
This pr add support resolve SQL query reference a column by an alias, for 
example:
 ```sql
 select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
 ```

Teradata and snowflake support this feature: 
[https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ]

[https://www.mssqltips.com/sqlservertutorial/9292/snowflake-regular-expression-alias-and-ilike]

  was:
This pr add support resolve SQL query reference a column by an alias, for 
example:
```sql
select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
```

Teradata support this feature: 
https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ


> Resolve SQL query reference a column by an alias
> 
>
> Key: SPARK-33410
> URL: https://issues.apache.org/jira/browse/SPARK-33410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> This pr add support resolve SQL query reference a column by an alias, for 
> example:
>  ```sql
>  select id + 1 as new_id, new_id + 1 as new_new_id from range(5);
>  ```
> Teradata and snowflake support this feature: 
> [https://docs.teradata.com/reader/e79ET77~NzPDz~Ykinj44w/MKSYuTyx2UJWXzdHJf3~sQ]
> [https://www.mssqltips.com/sqlservertutorial/9292/snowflake-regular-expression-alias-and-ilike]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34382) ANSI SQL: LATERAL derived table(T491)

2021-02-05 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280051#comment-17280051
 ] 

Xiao Li commented on SPARK-34382:
-

reopened. This is a nice SQL feature we can support. This is also being 
supported by the other database systems.

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-34382
> URL: https://issues.apache.org/jira/browse/SPARK-34382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34382) ANSI SQL: LATERAL derived table(T491)

2021-02-05 Thread Xiao Li (Jira)

Xiao Li created SPARK-34382:
---

 Summary: ANSI SQL: LATERAL derived table(T491)
 Key: SPARK-34382
 URL: https://issues.apache.org/jira/browse/SPARK-34382
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
This allows them to reference columns provided by preceding {{FROM}} items. 
(Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
cross-reference any other {{FROM}} item.)

Table functions appearing in {{FROM}} can also be preceded by the key word 
{{LATERAL}}, but for functions the key word is optional; the function's 
arguments can contain references to columns provided by preceding {{FROM}} 
items in any case.

A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
{{JOIN}} tree. In the latter case it can also refer to any items that are on 
the left-hand side of a {{JOIN}} that it is on the right-hand side of.

When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation proceeds 
as follows: for each row of the {{FROM}} item providing the cross-referenced 
column(s), or set of rows of multiple {{FROM}} items providing the columns, the 
{{LATERAL}} item is evaluated using that row or row set's values of the 
columns. The resulting row(s) are joined as usual with the rows they were 
computed from. This is repeated for each row or set of rows from the column 
source table(s).

A trivial example of {{LATERAL}} is
{code:sql}
SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
{code}

*Feature ID*: T491

[https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
[https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27877) ANSI SQL: LATERAL derived table(T491)

2021-02-05 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27877:

Issue Type: Technical task  (was: Sub-task)

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-27877
> URL: https://issues.apache.org/jira/browse/SPARK-27877
> Project: Spark
>  Issue Type: Technical task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27877) ANSI SQL: LATERAL derived table(T491)

2021-02-05 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27877:

Issue Type: Sub-task  (was: Technical task)

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-27877
> URL: https://issues.apache.org/jira/browse/SPARK-27877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34346) io.file.buffer.size set by spark.buffer.size will override by hive-site.xml may cause perf regression

2021-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34346:
--
Affects Version/s: 3.0.2

> io.file.buffer.size set by spark.buffer.size will override by hive-site.xml 
> may cause perf regression
> -
>
> Key: SPARK-34346
> URL: https://issues.apache.org/jira/browse/SPARK-34346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 3.0.2, 3.1.1
>
>
> In many real-world cases, when interacting with hive catalog through Spark 
> SQL, users may just share the `hive-site.xml` for their hive jobs and make a 
> copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop 
> configurations, we will use `spark.buffer.size(65536)` to reset  
> `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may 
> ignore this behavior and reset `io.file.buffer.size` again according to 
> `hive-site.xml`.
> 1. The configuration priority for setting Hadoop and Hive config here is not 
> right, while literally, the order should be `spark > spark.hive > 
> spark.hadoop > hive > hadoop`
> 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO 
> performance w/ HDFS if there is an existing `io.file.buffer.size` in 
> hive-site.xml 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34346) io.file.buffer.size set by spark.buffer.size will override by hive-site.xml may cause perf regression

2021-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34346.
---
Fix Version/s: 3.1.1
   3.0.2
 Assignee: Kent Yao
   Resolution: Fixed

> io.file.buffer.size set by spark.buffer.size will override by hive-site.xml 
> may cause perf regression
> -
>
> Key: SPARK-34346
> URL: https://issues.apache.org/jira/browse/SPARK-34346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 3.0.2, 3.1.1
>
>
> In many real-world cases, when interacting with hive catalog through Spark 
> SQL, users may just share the `hive-site.xml` for their hive jobs and make a 
> copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop 
> configurations, we will use `spark.buffer.size(65536)` to reset  
> `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may 
> ignore this behavior and reset `io.file.buffer.size` again according to 
> `hive-site.xml`.
> 1. The configuration priority for setting Hadoop and Hive config here is not 
> right, while literally, the order should be `spark > spark.hive > 
> spark.hadoop > hive > hadoop`
> 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO 
> performance w/ HDFS if there is an existing `io.file.buffer.size` in 
> hive-site.xml 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2021-02-05 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279995#comment-17279995
 ] 

Cheng Su commented on SPARK-18591:
--

Just came across this Jira. We added the same feature 2 years ago internally, 
and it is working well and serving ~300 queries per day. I feel even we do the 
logical-to-physical planning in bottom-up way, there's still more work needed 
to be done as `outputOrdering` is only valid after `EnsureRequirements` rule. 
We added this as a physical plan rule post `EnsureRequirements`.

 

Given we already have a bunch of physical plan rules post `EnsureRequirements` 
- 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L354-L359]
 . Shall we just promote them into a phase e.g. called physical plan 
optimization, and we can add this rule as part of optimization phase? cc 
[~maropu] and [~cloud_fan], thanks.

 

> Replace hash-based aggregates with sort-based ones if inputs already sorted
> ---
>
> Key: SPARK-18591
> URL: https://issues.apache.org/jira/browse/SPARK-18591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: bulk-closed
>
> Spark currently uses sort-based aggregates only in limited condition; the 
> cases where spark cannot use partial aggregates and hash-based ones.
> However, if input ordering has already satisfied the requirements of 
> sort-based aggregates, it seems sort-based ones are faster than the other.
> {code}
> ./bin/spark-shell --conf spark.sql.shuffle.partitions=1
> val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
> value").sort($"key").cache
> def timer[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   val t1 = System.nanoTime()
>   println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
>   result
> }
> timer {
>   df.groupBy("key").count().count
> }
> // codegen'd hash aggregate
> Elapsed time: 7.116962977s
> // non-codegen'd sort aggregarte
> Elapsed time: 3.088816662s
> {code}
> If codegen'd sort-based aggregates are supported in SPARK-16844, this seems 
> to make the performance gap bigger;
> {code}
> - codegen'd sort aggregate
> Elapsed time: 1.645234684s
> {code} 
> Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279982#comment-17279982
 ] 

Apache Spark commented on SPARK-34380:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/31494

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279981#comment-17279981
 ] 

Apache Spark commented on SPARK-34380:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/31494

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34380:


Assignee: Apache Spark

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34380:


Assignee: (was: Apache Spark)

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34381) c

2021-02-05 Thread Bruno Fortes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Fortes updated SPARK-34381:
-
Summary: c  (was: Encoding is not working if multiLine option is true. 
Spark 2.4.0)

> c
> -
>
> Key: SPARK-34381
> URL: https://issues.apache.org/jira/browse/SPARK-34381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bruno Fortes
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34381) Encoding is not working if multiLine option is true. Spark 2.4.0

2021-02-05 Thread Bruno Fortes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Fortes updated SPARK-34381:
-
Attachment: (was: hive.PNG)

> Encoding is not working if multiLine option is true. Spark 2.4.0
> 
>
> Key: SPARK-34381
> URL: https://issues.apache.org/jira/browse/SPARK-34381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bruno Fortes
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34381) Encoding is not working if multiLine option is true. Spark 2.4.0

2021-02-05 Thread Bruno Fortes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Fortes updated SPARK-34381:
-
Attachment: (was: spark.png)

> Encoding is not working if multiLine option is true. Spark 2.4.0
> 
>
> Key: SPARK-34381
> URL: https://issues.apache.org/jira/browse/SPARK-34381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bruno Fortes
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34381) Encoding is not working if multiLine option is true. Spark 2.4.0

2021-02-05 Thread Bruno Fortes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Fortes resolved SPARK-34381.
--
Resolution: Fixed

> Encoding is not working if multiLine option is true. Spark 2.4.0
> 
>
> Key: SPARK-34381
> URL: https://issues.apache.org/jira/browse/SPARK-34381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bruno Fortes
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34381) Encoding is not working if multiLine option is true. Spark 2.4.0

2021-02-05 Thread Bruno Fortes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Fortes updated SPARK-34381:
-
Attachment: (was: csv.PNG)

> Encoding is not working if multiLine option is true. Spark 2.4.0
> 
>
> Key: SPARK-34381
> URL: https://issues.apache.org/jira/browse/SPARK-34381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bruno Fortes
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34381) Encoding is not working if multiLine option is true. Spark 2.4.0

2021-02-05 Thread Bruno Fortes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Fortes updated SPARK-34381:
-
Description: (was: I have a csv file with multiline and encoding 
ISO=8859-1, but if I enable multiline, the encoding is set to default(utf-8) 
automaticaly and broke the character but the multiline work. Can someone help 
me with this issue?

!spark.png!

!csv.PNG!

!hive.PNG!)

> Encoding is not working if multiLine option is true. Spark 2.4.0
> 
>
> Key: SPARK-34381
> URL: https://issues.apache.org/jira/browse/SPARK-34381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bruno Fortes
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33207) Reduce the number of tasks launched after bucket pruning

2021-02-05 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279947#comment-17279947
 ] 

Cheng Su commented on SPARK-33207:
--

[~yumwang] - just an update, after [https://github.com/apache/spark/pull/31413] 
was merged, this issue should be resolved. I tested the query in jira 
description, and verified only 1 task executed. cc [~cloud_fan] as well.

 

!Screen Shot 2021-02-05 at 11.44.12 AM.png!

> Reduce the number of tasks launched after bucket pruning
> 
>
> Key: SPARK-33207
> URL: https://issues.apache.org/jira/browse/SPARK-33207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Screen Shot 2021-02-05 at 11.44.12 AM.png, 
> image-2020-10-22-15-17-01-389.png, image-2020-10-22-15-17-26-956.png
>
>
> We only need to read 1 bucket, but it still launch 200 tasks.
> {code:sql}
> create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
> 200 buckets AS (SELECT id FROM range(1000) cluster by id)
> spark-sql> explain select * from test_bucket where id = 4;
> == Physical Plan ==
> *(1) Project [id#7L]
> +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
>+- *(1) ColumnarToRow
>   +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
> DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
> ReadSchema: struct, SelectedBucketsCount: 1 out of 200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33207) Reduce the number of tasks launched after bucket pruning

2021-02-05 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-33207:
-
Attachment: Screen Shot 2021-02-05 at 11.44.12 AM.png

> Reduce the number of tasks launched after bucket pruning
> 
>
> Key: SPARK-33207
> URL: https://issues.apache.org/jira/browse/SPARK-33207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Screen Shot 2021-02-05 at 11.44.12 AM.png, 
> image-2020-10-22-15-17-01-389.png, image-2020-10-22-15-17-26-956.png
>
>
> We only need to read 1 bucket, but it still launch 200 tasks.
> {code:sql}
> create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
> 200 buckets AS (SELECT id FROM range(1000) cluster by id)
> spark-sql> explain select * from test_bucket where id = 4;
> == Physical Plan ==
> *(1) Project [id#7L]
> +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
>+- *(1) ColumnarToRow
>   +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
> DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
> ReadSchema: struct, SelectedBucketsCount: 1 out of 200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34381) Encoding is not working if multiLine option is true. Spark 2.4.0

2021-02-05 Thread Bruno Fortes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Fortes updated SPARK-34381:
-
Description: 
I have a csv file with multiline and encoding ISO=8859-1, but if I enable 
multiline, the encoding is set to default(utf-8) automaticaly and broke the 
character but the multiline work. Can someone help me with this issue?

!spark.png!

!csv.PNG!

!hive.PNG!

  was:I have a csv file with multiline and encoding ISO=8859-1, but if I enable 
multiline, the encoding is set to default(utf-8) automaticaly and broke the 
character but the multiline work. Can someone help me with this issue?


> Encoding is not working if multiLine option is true. Spark 2.4.0
> 
>
> Key: SPARK-34381
> URL: https://issues.apache.org/jira/browse/SPARK-34381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bruno Fortes
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: csv.PNG, hive.PNG, spark.png
>
>
> I have a csv file with multiline and encoding ISO=8859-1, but if I enable 
> multiline, the encoding is set to default(utf-8) automaticaly and broke the 
> character but the multiline work. Can someone help me with this issue?
> !spark.png!
> !csv.PNG!
> !hive.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34381) Encoding is not working if multiLine option is true. Spark 2.4.0

2021-02-05 Thread Bruno Fortes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Fortes updated SPARK-34381:
-
Attachment: spark.png
csv.PNG
hive.PNG

> Encoding is not working if multiLine option is true. Spark 2.4.0
> 
>
> Key: SPARK-34381
> URL: https://issues.apache.org/jira/browse/SPARK-34381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bruno Fortes
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: csv.PNG, hive.PNG, spark.png
>
>
> I have a csv file with multiline and encoding ISO=8859-1, but if I enable 
> multiline, the encoding is set to default(utf-8) automaticaly and broke the 
> character but the multiline work. Can someone help me with this issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34381) Encoding is not working if multiLine option is true. Spark 2.4.0

2021-02-05 Thread Bruno Fortes (Jira)

Bruno Fortes created SPARK-34381:


 Summary: Encoding is not working if multiLine option is true. 
Spark 2.4.0
 Key: SPARK-34381
 URL: https://issues.apache.org/jira/browse/SPARK-34381
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Bruno Fortes
 Fix For: 2.4.0
 Attachments: csv.PNG, hive.PNG, spark.png

I have a csv file with multiline and encoding ISO=8859-1, but if I enable 
multiline, the encoding is set to default(utf-8) automaticaly and broke the 
character but the multiline work. Can someone help me with this issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33501) Encoding is not working if multiLine option is true.

2021-02-05 Thread Bruno Fortes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279928#comment-17279928
 ] 

Bruno Fortes edited comment on SPARK-33501 at 2/5/21, 7:19 PM:
---

Hey, did you guys find any solution for this issue? I'm having the same

[~hyukjin.kwon]

[~nileshpatil1992]


was (Author: brunofortes):
Hey, did you guys find any solution for this issue? I'm having the same

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.3
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv, Screenshot from 2020-11-24 
> 10-27-17.png
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33501) Encoding is not working if multiLine option is true.

2021-02-05 Thread Bruno Fortes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279928#comment-17279928
 ] 

Bruno Fortes commented on SPARK-33501:
--

Hey, did you guys find any solution for this issue? I'm having the same

> Encoding is not working if multiLine option is true.
> 
>
> Key: SPARK-33501
> URL: https://issues.apache.org/jira/browse/SPARK-33501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.3
>Reporter: Nilesh Patil
>Priority: Major
> Attachments: 1605860036183.csv, Screenshot from 2020-11-24 
> 10-27-17.png
>
>
> If we read with mulitLine true and encoding with "ISO-8859-1" then we are 
> getting  value like this {color:#ff}AUTO EL*�*TRICA{color}. and if we 
> read with multiLine false  and encoding with "ISO-8859-1" thne we are getting 
>  value like {color:#ff}AUTO EL*É*TRICA{color}
> Below is the code we are using
> {code}
> spark.read().option("header", "true").option("inferSchema", 
> true).option("delimiter", ";") .option("quote", "\"") .option("multiLine", 
> true) .option("encoding", "ISO-8859-1").csv("1605860036183.csv").show()
> {code}
> Sample file is attached in attachement 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

2021-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-26836:
-

Assignee: Attila Zsolt Piros

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> 
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
> Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>Reporter: Tamás Németh
>Assignee: Attila Zsolt Piros
>Priority: Critical
>  Labels: correctness
> Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be 
> mixed up (on the data below the first name column becomes the extra_field 
> column. I guess because in the latest schema it is the second column):
>  
> {code:java}
> number,extra_field,first_name,last_name,dt 
> 6,Colin,Baker,null,2019-02-05 
> 3,Jon,Pertwee,null,2019-02-05 
> 4,Tom,Baker,null,2019-02-05 
> 5,Peter,Davison,null,2019-02-05 
> 11,Matt,Smith,null,2019-02-05 
> 1,William,Hartnell,null,2019-02-05 
> 7,Sylvester,McCoy,null,2019-02-05 
> 8,Paul,McGann,null,2019-02-05 
> 2,Patrick,Troughton,null,2019-02-05 
> 9,Christopher,Eccleston,null,2019-02-05 
> 10,David,Tennant,null,2019-02-05 
> 21,fishfinger,Jim,Baker,2019-02-06 
>

[jira] [Resolved] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves

2021-02-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26836.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31133
[https://github.com/apache/spark/pull/31133]

> Columns get switched in Spark SQL using Avro backed Hive table if schema 
> evolves
> 
>
> Key: SPARK-26836
> URL: https://issues.apache.org/jira/browse/SPARK-26836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
> Environment: I tested with Hive and HCatalog which runs on version 
> 2.3.4 and with Spark 2.3.1 and 2.4
>Reporter: Tamás Németh
>Assignee: Attila Zsolt Piros
>Priority: Critical
>  Labels: correctness
> Fix For: 3.2.0
>
> Attachments: doctors.avro, doctors_evolved.avro, 
> doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the 
> avro files. 
> In the table definiton the avro.schema.url always points to the latest 
> partition's _schema.avsc file which is always the lates schema. (Avro schemas 
> are backward and forward compatible in a table)
> When new data comes in, I always add a new partition where the 
> avro.schema.url properties also set to the _schema.avsc which was used when 
> it was added and of course I always update the table avro.schema.url property 
> to the latest one.
> Querying this table works fine until the schema evolves in a way that a new 
> optional property is added in the middle. 
> When this happens then after the spark sql query the columns in the old 
> partition gets mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives 
> me back the correct columns for the partitions which were created the old 
> schema and for the new which was created the evolved schema.
>  
> Here is how I could reproduce with the 
> [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
>  example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before 
> evolving and the second one had the evolved one. (the evolved schema is the 
> same as in your testcase except I moved the extra_field column to the last 
> from the second and I generated two lines of avro data with the evolved 
> schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no 
> columns switch happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the 
> schema which is under the partition folder (non-evolved one -> 
> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be 
> mixed up (on the data below the first name column becomes the extra_field 
> column. I guess because in the latest schema it is the second column):
>  
> {code:java}
> number,extra_field,first_name,last_name,dt 
> 6,Colin,Baker,null,2019-02-05 
> 3,Jon,Pertwee,null,2019-02-05 
> 4,Tom,Baker,null,2019-02-05 
> 5,Peter,Davison,null,2019-02-05 
> 11,Matt,Smith,null,2019-02-05 
> 1,William,Hartnell,null,2019-02-05 
> 7,Sylvester,McCoy,null,2019-02-05 
> 8,Paul,McGann,null,2019-02-05 
> 2,Patrick,Troughton,null,2019-02-05 
>

[jira] [Commented] (SPARK-34363) Allow users to configure a maximum amount of remote shuffle block storage

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279899#comment-17279899
 ] 

Apache Spark commented on SPARK-34363:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31493

> Allow users to configure a maximum amount of remote shuffle block storage
> -
>
> Key: SPARK-34363
> URL: https://issues.apache.org/jira/browse/SPARK-34363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34363) Allow users to configure a maximum amount of remote shuffle block storage

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279898#comment-17279898
 ] 

Apache Spark commented on SPARK-34363:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31493

> Allow users to configure a maximum amount of remote shuffle block storage
> -
>
> Key: SPARK-34363
> URL: https://issues.apache.org/jira/browse/SPARK-34363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34363) Allow users to configure a maximum amount of remote shuffle block storage

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34363:


Assignee: (was: Apache Spark)

> Allow users to configure a maximum amount of remote shuffle block storage
> -
>
> Key: SPARK-34363
> URL: https://issues.apache.org/jira/browse/SPARK-34363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34363) Allow users to configure a maximum amount of remote shuffle block storage

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34363:


Assignee: Apache Spark

> Allow users to configure a maximum amount of remote shuffle block storage
> -
>
> Key: SPARK-34363
> URL: https://issues.apache.org/jira/browse/SPARK-34363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34346) io.file.buffer.size set by spark.buffer.size will override by hive-site.xml may cause perf regression

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279896#comment-17279896
 ] 

Apache Spark commented on SPARK-34346:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31492

> io.file.buffer.size set by spark.buffer.size will override by hive-site.xml 
> may cause perf regression
> -
>
> Key: SPARK-34346
> URL: https://issues.apache.org/jira/browse/SPARK-34346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Priority: Blocker
>
> In many real-world cases, when interacting with hive catalog through Spark 
> SQL, users may just share the `hive-site.xml` for their hive jobs and make a 
> copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop 
> configurations, we will use `spark.buffer.size(65536)` to reset  
> `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may 
> ignore this behavior and reset `io.file.buffer.size` again according to 
> `hive-site.xml`.
> 1. The configuration priority for setting Hadoop and Hive config here is not 
> right, while literally, the order should be `spark > spark.hive > 
> spark.hadoop > hive > hadoop`
> 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO 
> performance w/ HDFS if there is an existing `io.file.buffer.size` in 
> hive-site.xml 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-05 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279893#comment-17279893
 ] 

Terry Kim commented on SPARK-34380:
---

Looking

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34346) io.file.buffer.size set by spark.buffer.size will override by hive-site.xml may cause perf regression

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279894#comment-17279894
 ] 

Apache Spark commented on SPARK-34346:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31492

> io.file.buffer.size set by spark.buffer.size will override by hive-site.xml 
> may cause perf regression
> -
>
> Key: SPARK-34346
> URL: https://issues.apache.org/jira/browse/SPARK-34346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Priority: Blocker
>
> In many real-world cases, when interacting with hive catalog through Spark 
> SQL, users may just share the `hive-site.xml` for their hive jobs and make a 
> copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop 
> configurations, we will use `spark.buffer.size(65536)` to reset  
> `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may 
> ignore this behavior and reset `io.file.buffer.size` again according to 
> `hive-site.xml`.
> 1. The configuration priority for setting Hadoop and Hive config here is not 
> right, while literally, the order should be `spark > spark.hive > 
> spark.hadoop > hive > hadoop`
> 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO 
> performance w/ HDFS if there is an existing `io.file.buffer.size` in 
> hive-site.xml 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-05 Thread Terry Kim (Jira)

Terry Kim created SPARK-34380:
-

 Summary: Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
 Key: SPARK-34380
 URL: https://issues.apache.org/jira/browse/SPARK-34380
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Terry Kim


Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34379) Map JDBC RowID to StringType rather than LongType

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279892#comment-17279892
 ] 

Apache Spark commented on SPARK-34379:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31491

> Map JDBC RowID to StringType rather than LongType
> -
>
> Key: SPARK-34379
> URL: https://issues.apache.org/jira/browse/SPARK-34379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation, JDBC RowID type is mapped to LongType except 
> for OracleDialect, but there is no guarantee to be able to convert RowID to 
> long.
> java.sql.RowId declares toString and the specification of java.sql.RowId says
> _all methods on the RowId interface must be fully implemented if the JDBC 
> driver supports the data type_
> (https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html)
> So, we should prefer StringType to LongType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34379) Map JDBC RowID to StringType rather than LongType

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34379:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Map JDBC RowID to StringType rather than LongType
> -
>
> Key: SPARK-34379
> URL: https://issues.apache.org/jira/browse/SPARK-34379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation, JDBC RowID type is mapped to LongType except 
> for OracleDialect, but there is no guarantee to be able to convert RowID to 
> long.
> java.sql.RowId declares toString and the specification of java.sql.RowId says
> _all methods on the RowId interface must be fully implemented if the JDBC 
> driver supports the data type_
> (https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html)
> So, we should prefer StringType to LongType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34379) Map JDBC RowID to StringType rather than LongType

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34379:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Map JDBC RowID to StringType rather than LongType
> -
>
> Key: SPARK-34379
> URL: https://issues.apache.org/jira/browse/SPARK-34379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In the current implementation, JDBC RowID type is mapped to LongType except 
> for OracleDialect, but there is no guarantee to be able to convert RowID to 
> long.
> java.sql.RowId declares toString and the specification of java.sql.RowId says
> _all methods on the RowId interface must be fully implemented if the JDBC 
> driver supports the data type_
> (https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html)
> So, we should prefer StringType to LongType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34379) Map JDBC RowID to StringType rather than LongType

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279891#comment-17279891
 ] 

Apache Spark commented on SPARK-34379:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31491

> Map JDBC RowID to StringType rather than LongType
> -
>
> Key: SPARK-34379
> URL: https://issues.apache.org/jira/browse/SPARK-34379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation, JDBC RowID type is mapped to LongType except 
> for OracleDialect, but there is no guarantee to be able to convert RowID to 
> long.
> java.sql.RowId declares toString and the specification of java.sql.RowId says
> _all methods on the RowId interface must be fully implemented if the JDBC 
> driver supports the data type_
> (https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html)
> So, we should prefer StringType to LongType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34379) Map JDBC RowID to StringType rather than LongType

2021-02-05 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-34379:
---
Description: 
In the current implementation, JDBC RowID type is mapped to LongType except for 
OracleDialect, but there is no guarantee to be able to convert RowID to long.
java.sql.RowId declares toString and the specification of java.sql.RowId says

_all methods on the RowId interface must be fully implemented if the JDBC 
driver supports the data type_
(https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html)

So, we should prefer StringType to LongType.

  was:
In the current implementation, JDBC RowID type is mapped to LongType except for 
OracleDialect, but there is no guarantee to be able to convert RowID to long.
The specification of java.sql.RowId says

"all methods on the RowId interface must be fully implemented if the JDBC 
driver supports the data type"
https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html


java.sql.RowId declares toString so, we should prefer StringType to LongType.


> Map JDBC RowID to StringType rather than LongType
> -
>
> Key: SPARK-34379
> URL: https://issues.apache.org/jira/browse/SPARK-34379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation, JDBC RowID type is mapped to LongType except 
> for OracleDialect, but there is no guarantee to be able to convert RowID to 
> long.
> java.sql.RowId declares toString and the specification of java.sql.RowId says
> _all methods on the RowId interface must be fully implemented if the JDBC 
> driver supports the data type_
> (https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html)
> So, we should prefer StringType to LongType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34379) Map JDBC RowID to StringType rather than LongType

2021-02-05 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-34379:
--

 Summary: Map JDBC RowID to StringType rather than LongType
 Key: SPARK-34379
 URL: https://issues.apache.org/jira/browse/SPARK-34379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current implementation, JDBC RowID type is mapped to LongType except for 
OracleDialect, but there is no guarantee to be able to convert RowID to long.
The specification of java.sql.RowId says

"all methods on the RowId interface must be fully implemented if the JDBC 
driver supports the data type"
https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html


java.sql.RowId declares toString so, we should prefer StringType to LongType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34379) Map JDBC RowID to StringType rather than LongType

2021-02-05 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-34379:
---
Issue Type: Bug  (was: Improvement)

> Map JDBC RowID to StringType rather than LongType
> -
>
> Key: SPARK-34379
> URL: https://issues.apache.org/jira/browse/SPARK-34379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation, JDBC RowID type is mapped to LongType except 
> for OracleDialect, but there is no guarantee to be able to convert RowID to 
> long.
> The specification of java.sql.RowId says
> "all methods on the RowId interface must be fully implemented if the JDBC 
> driver supports the data type"
> https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html
> java.sql.RowId declares toString so, we should prefer StringType to LongType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34365:


Assignee: (was: Apache Spark)

> Support configurable Avro schema field matching for positional or by-name
> -
>
> Key: SPARK-34365
> URL: https://issues.apache.org/jira/browse/SPARK-34365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Major
>
> When reading an Avro dataset (using the dataset's schema or by overriding it 
> with 'avroSchema') or writing an Avro dataset with a provided schema by 
> 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by 
> field name.
> This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at 
> least on the write path, we would match the schemas by positionally 
> ("structural" comparison). While I agree that this is much more sensible for 
> default behavior, I propose that we make this behavior configurable using an 
> {{option}} for the Avro datasource. Even at the time that SPARK-27762 was 
> handled, there was [interest in making this behavior 
> configurable|https://github.com/apache/spark/pull/24635#issuecomment-494205251],
>  but it appears it went unaddressed.
> There is precedence for configurability of this behavior as seen in 
> SPARK-32864, which added this support for ORC. Besides this precedence, the 
> behavior of Hive is to perform matching positionally 
> ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]),
>  so this is behavior that Hadoop/Hive ecosystem users are familiar with:
> {quote}
> Hive is very forgiving about types: it will attempt to store whatever value 
> matches the provided column in the equivalent column position in the new 
> table. No matching is done on column names, for instance.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279862#comment-17279862
 ] 

Apache Spark commented on SPARK-34365:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/31490

> Support configurable Avro schema field matching for positional or by-name
> -
>
> Key: SPARK-34365
> URL: https://issues.apache.org/jira/browse/SPARK-34365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Major
>
> When reading an Avro dataset (using the dataset's schema or by overriding it 
> with 'avroSchema') or writing an Avro dataset with a provided schema by 
> 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by 
> field name.
> This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at 
> least on the write path, we would match the schemas by positionally 
> ("structural" comparison). While I agree that this is much more sensible for 
> default behavior, I propose that we make this behavior configurable using an 
> {{option}} for the Avro datasource. Even at the time that SPARK-27762 was 
> handled, there was [interest in making this behavior 
> configurable|https://github.com/apache/spark/pull/24635#issuecomment-494205251],
>  but it appears it went unaddressed.
> There is precedence for configurability of this behavior as seen in 
> SPARK-32864, which added this support for ORC. Besides this precedence, the 
> behavior of Hive is to perform matching positionally 
> ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]),
>  so this is behavior that Hadoop/Hive ecosystem users are familiar with:
> {quote}
> Hive is very forgiving about types: it will attempt to store whatever value 
> matches the provided column in the equivalent column position in the new 
> table. No matching is done on column names, for instance.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34365:


Assignee: Apache Spark

> Support configurable Avro schema field matching for positional or by-name
> -
>
> Key: SPARK-34365
> URL: https://issues.apache.org/jira/browse/SPARK-34365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> When reading an Avro dataset (using the dataset's schema or by overriding it 
> with 'avroSchema') or writing an Avro dataset with a provided schema by 
> 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by 
> field name.
> This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at 
> least on the write path, we would match the schemas by positionally 
> ("structural" comparison). While I agree that this is much more sensible for 
> default behavior, I propose that we make this behavior configurable using an 
> {{option}} for the Avro datasource. Even at the time that SPARK-27762 was 
> handled, there was [interest in making this behavior 
> configurable|https://github.com/apache/spark/pull/24635#issuecomment-494205251],
>  but it appears it went unaddressed.
> There is precedence for configurability of this behavior as seen in 
> SPARK-32864, which added this support for ORC. Besides this precedence, the 
> behavior of Hive is to perform matching positionally 
> ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]),
>  so this is behavior that Hadoop/Hive ecosystem users are familiar with:
> {quote}
> Hive is very forgiving about types: it will attempt to store whatever value 
> matches the provided column in the equivalent column position in the new 
> table. No matching is done on column names, for instance.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34134) LDAP authentication of spark thrift server support user id mapping

2021-02-05 Thread Timothy Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279847#comment-17279847
 ] 

Timothy Zhang commented on SPARK-34134:
---

Almost applications I used can support it, such as Cognos, Jekins, Graylog, 
etc. 

> LDAP authentication of spark thrift server support user id mapping
> --
>
> Key: SPARK-34134
> URL: https://issues.apache.org/jira/browse/SPARK-34134
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.0.1
>Reporter: Timothy Zhang
>Priority: Major
>
> I'm trying to configure LDAP authentication of spark thrift server, and would 
> like to implement user id mapping to mail address.
> My scenario is, "uid" is the key of our LDAP system, and "mail"(email 
> address) is one of attributes. Now we want users to input email address, i.e. 
> "mail" when they login thrift client. That is to map "username" input to mail 
> attribute query. e.g.
> {code:none}
> hive.server2.authentication.ldap.customLDAPQuery="(&(objectClass=person)(mail=${uid}))"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34378) Support extra optional Avro fields in AvroSerializer

2021-02-05 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279839#comment-17279839
 ] 

Erik Krogen commented on SPARK-34378:
-

Internally we build this feature on top of SPARK-34365, so I will wait until 
that JIRA is finalized before posting a PR here.

> Support extra optional Avro fields in AvroSerializer
> 
>
> Key: SPARK-34378
> URL: https://issues.apache.org/jira/browse/SPARK-34378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Major
>
> Currently, when writing out Avro data using a custom schema ({{avroSchema}}), 
> if there are any extra Avro fields which do not have a matching field in the 
> Catalyst schema, the serialization will fail. This is much more strict than 
> on the deserialization path, where Avro fields not present in the Catalyst 
> schema are ignored, and Catalyst fields not present in the Avro schema are 
> allowed as long as they are nullable. I believe it will be more user-friendly 
> if extra Avro fields are allowed, as long as they are optional. This makes it 
> easier for users to write out data with Avro schemas which may be outside of 
> their control.
> If there is concern about the safety of this approach (i.e. there are use 
> cases where users want strict matching), we can make it configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34378) Support extra optional Avro fields in AvroSerializer

2021-02-05 Thread Erik Krogen (Jira)

Erik Krogen created SPARK-34378:
---

 Summary: Support extra optional Avro fields in AvroSerializer
 Key: SPARK-34378
 URL: https://issues.apache.org/jira/browse/SPARK-34378
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: Erik Krogen


Currently, when writing out Avro data using a custom schema ({{avroSchema}}), 
if there are any extra Avro fields which do not have a matching field in the 
Catalyst schema, the serialization will fail. This is much more strict than on 
the deserialization path, where Avro fields not present in the Catalyst schema 
are ignored, and Catalyst fields not present in the Avro schema are allowed as 
long as they are nullable. I believe it will be more user-friendly if extra 
Avro fields are allowed, as long as they are optional. This makes it easier for 
users to write out data with Avro schemas which may be outside of their control.

If there is concern about the safety of this approach (i.e. there are use cases 
where users want strict matching), we can make it configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths

2021-02-05 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279798#comment-17279798
 ] 

Steve Loughran commented on SPARK-34298:


well, I'm sure a PR with tests will get reviewed...

> SaveMode.Overwrite not usable when using s3a root paths 
> 
>
> Key: SPARK-34298
> URL: https://issues.apache.org/jira/browse/SPARK-34298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: cornel creanga
>Priority: Minor
>
> SaveMode.Overwrite does not work when using paths containing just the root eg 
> "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials 
> are needed):
> {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color}
> {color:#0033b3}val {color}{color:#00}sparkContext{color}: 
> {color:#00}SparkContext {color}= 
> {color:#00}SparkContext{color}.getOrCreate()
> {color:#0033b3}val {color}{color:#00}someData {color}= 
> {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, 
> {color:#067d17}"mouse"{color}))
> {color:#0033b3}val {color}{color:#00}someSchema {color}= 
> {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, 
> {color:#00}IntegerType{color}, 
> {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, 
> {color:#00}StringType{color},{color:#0033b3}true{color}))
> {color:#0033b3}val {color}{color:#00}someDF {color}= 
> {color:#871094}spark{color}.createDataFrame(
>  
> {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color},
>  accessK{color:#00}ey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color},
>  {color:#00}secretKey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color},
>  
> {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color})
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color},
>  {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color})
> {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color})
>  .save({color:#00}out{color})
>  
> Error stacktrace:
> Exception in thread "main" java.lang.IllegalArgumentException: Can not create 
> a Path from an empty string
>  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[]
> at org.apache.hadoop.fs.Path.suffix(Path.java:446)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240)
>  
> If you change out from {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val 
> {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code 
> works.{color}{color}
> {color:#067d17}{color:#172b4d}There are two problems in the actual code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color}
> {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix 
> method that doesn't work on root paths
> {color}{color}
> {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly 
> (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem 
> class){color}{color}
> {color:#067d17}{color:#172b4d}I think that there are two 
> choices:{color}{color}
> {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite 
> mode for root folders {color}{color}
> {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the 
> Path.suffix method and change the clean up code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root 
> folder content and delete the entries one by one.{color}{color}
> I can provide a patch for both choices, assuming that they make sense.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34377) Support parquet datasource options to control datetime rebasing in read

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279731#comment-17279731
 ] 

Apache Spark commented on SPARK-34377:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31489

> Support parquet datasource options to control datetime rebasing in read
> ---
>
> Key: SPARK-34377
> URL: https://issues.apache.org/jira/browse/SPARK-34377
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add new parquet options similar to the SQL configs 
> {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}} and 
> {{spark.sql.legacy.parquet.int96RebaseModeInRead.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34377) Support parquet datasource options to control datetime rebasing in read

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34377:


Assignee: (was: Apache Spark)

> Support parquet datasource options to control datetime rebasing in read
> ---
>
> Key: SPARK-34377
> URL: https://issues.apache.org/jira/browse/SPARK-34377
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add new parquet options similar to the SQL configs 
> {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}} and 
> {{spark.sql.legacy.parquet.int96RebaseModeInRead.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34377) Support parquet datasource options to control datetime rebasing in read

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34377:


Assignee: Apache Spark

> Support parquet datasource options to control datetime rebasing in read
> ---
>
> Key: SPARK-34377
> URL: https://issues.apache.org/jira/browse/SPARK-34377
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add new parquet options similar to the SQL configs 
> {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}} and 
> {{spark.sql.legacy.parquet.int96RebaseModeInRead.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34377) Support parquet datasource options to control datetime rebasing in read

2021-02-05 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-34377:
--

 Summary: Support parquet datasource options to control datetime 
rebasing in read
 Key: SPARK-34377
 URL: https://issues.apache.org/jira/browse/SPARK-34377
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Add new parquet options similar to the SQL configs 
{{spark.sql.legacy.parquet.datetimeRebaseModeInRead}} and 
{{spark.sql.legacy.parquet.int96RebaseModeInRead.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34350) replace withTimeZone defined in OracleIntegrationSuite with DateTimeTestUtils.withDefaultTimeZone

2021-02-05 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34350.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31465

> replace withTimeZone defined in OracleIntegrationSuite with 
> DateTimeTestUtils.withDefaultTimeZone
> -
>
> Key: SPARK-34350
> URL: https://issues.apache.org/jira/browse/SPARK-34350
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> In OracleIntegrationSuite, withTimeZone method is defined and used only in it 
> to change the default timezone.
> On the other hand, withDefaultTimeZone method is defined in DateTimeTestUtils 
> as a utility method and it is semantically the same as withTimeZone.
> So might be better to replace withTimeZone with withDefaultTimeZone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32985) Decouple bucket filter pruning and bucket table scan

2021-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32985.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31413
[https://github.com/apache/spark/pull/31413]

> Decouple bucket filter pruning and bucket table scan
> 
>
> Key: SPARK-32985
> URL: https://issues.apache.org/jira/browse/SPARK-32985
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r493100510] . 
> Currently in data source v1 file scan `FileSourceScanExec`, bucket filter 
> pruning will only take effect with bucket table scan - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L542]
>  . However this is unnecessary, as bucket filter pruning can also happen if 
> we disable bucketed table scan. This help query leverage the benefit from 
> bucket filter pruning to save CPU/IO to not read unnecessary bucket files, 
> and do not bound by bucket table scan when the parallelism of tasks is a 
> concern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32985) Decouple bucket filter pruning and bucket table scan

2021-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32985:
---

Assignee: Cheng Su

> Decouple bucket filter pruning and bucket table scan
> 
>
> Key: SPARK-32985
> URL: https://issues.apache.org/jira/browse/SPARK-32985
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r493100510] . 
> Currently in data source v1 file scan `FileSourceScanExec`, bucket filter 
> pruning will only take effect with bucket table scan - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L542]
>  . However this is unnecessary, as bucket filter pruning can also happen if 
> we disable bucketed table scan. This help query leverage the benefit from 
> bucket filter pruning to save CPU/IO to not read unnecessary bucket files, 
> and do not bound by bucket table scan when the parallelism of tasks is a 
> concern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-34359:

Fix Version/s: (was: 3.2.0)
   3.1.1
   3.0.2

> add a legacy config to restore the output schema of SHOW DATABASES
> --
>
> Key: SPARK-34359
> URL: https://issues.apache.org/jira/browse/SPARK-34359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.2, 3.1.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2021-02-05 Thread Manu Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang resolved SPARK-32698.

Resolution: Won't Do

> Do not fall back to default parallelism if the minimum number of coalesced 
> partitions is not set in AQE
> ---
>
> Key: SPARK-32698
> URL: https://issues.apache.org/jira/browse/SPARK-32698
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Currently in AQE when coalescing shuffling partitions,
> {quote}We fall back to Spark default parallelism if the minimum number of 
> coalesced partitions is not set, so to avoid perf regressions compared to no 
> coalescing.
> {quote}
> From our experience, this has resulted in a lot of uncertainty of the number 
> of tasks after coalescing especially with dynamic allocation, and also lead 
> to many small output files. It's complex and hard to reason about.
> Hence, I'm proposing not falling back to the default parallelism but 
> coalescing towards the target size when the minimum number of coalesced 
> partitions is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279623#comment-17279623
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:44 AM:


Caffeine is different from guava cache in maximum size mechanism：
 * Guava Cache: Note that the cache may evict an entry before this limit is 
exceeded

 * Caffeine: Note that the cache may evict an entry before this limit is 
exceeded or {color:#FF}temporarily exceed the threshold while 
evicting.{color}

So Caffeine may have a little eviction delay, maybe 5ms


was (Author: luciferyang):
Caffeine is different from guava cache in maximum size mechanism：

* Guava Cache: Note that the cache may evict an entry before this limit is 
exceeded

* Caffeine: Note that the cache may evict an entry before this limit is 
exceeded or temporarily exceed the threshold while evicting. 

So Caffeine may have a little eviction delay, maybe 5ms


> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:45 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:
 * read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642 ops/s|
|Caffeine|44638046.442 ± 23455184.501 ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254 ops/s|
|Guava|1486.653 ± 5716763.921 ops/s|
 * *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288 ops/s|
|Caffeine|62507109.307 ± 72055321.581 ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711 ops/s|
|Guava|15102099.448 ± 6613818.000 ops/s|
 * *write only thrpt test(8 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079 ops/s|
|Caffeine|27899813.470 ± 11399461.937 ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561 ops/s|
|Guava|7675161.183 ± 6730863.169 ops/s|

*{color:#FF}It seems that the performance of caffeine is better than that 
of guava cache{color}*


was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

** read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

*  *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288  ops/s|
|Caffeine|62507109.307 ± 72055321.581  ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711  ops/s|
|Guava|15102099.448 ±  6613818.000  ops/s|

*  *write only thrpt test(8 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079  ops/s|
|Caffeine|27899813.470 ± 11399461.937  ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561  ops/s|
|Guava|7675161.183 ±  6730863.169  ops/s|

*{color:red}It seems that the performance of caffeine is better than that of 
guava cache{color}*



> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279623#comment-17279623
 ] 

Yang Jie commented on SPARK-34309:
--

Caffeine is different from guava cache in maximum size mechanism：

* Guava Cache: Note that the cache may evict an entry before this limit is 
exceeded

* Caffeine: Note that the cache may evict an entry before this limit is 
exceeded or temporarily exceed the threshold while evicting. 

So Caffeine may have a little eviction delay, maybe 5ms


> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34376) Support regexp as a function

2021-02-05 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34376:
-
Description: 
We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
We seemed to miss adding REGEXP as a SQL function just like RLIKE

This is also registered in Hive as a function, we can reduce the migration pain 
for those users


  was:
We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
We seemed to miss adding REGEXP as a SQL function just like RLIKE

This is also registered in Hive as a function, we can reduce the migration plan 
for those users



> Support regexp as a function
> 
>
> Key: SPARK-34376
> URL: https://issues.apache.org/jira/browse/SPARK-34376
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
> We seemed to miss adding REGEXP as a SQL function just like RLIKE
> This is also registered in Hive as a function, we can reduce the migration 
> pain for those users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:20 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

** read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

*  *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288  ops/s|
|Caffeine|62507109.307 ± 72055321.581  ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711  ops/s|
|Guava|15102099.448 ±  6613818.000  ops/s|

*  *write only thrpt test(8 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079  ops/s|
|Caffeine|27899813.470 ± 11399461.937  ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561  ops/s|
|Guava|7675161.183 ±  6730863.169  ops/s|

*{color:red}It seems that the performance of caffeine is better than that of 
guava cache{color}*




was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

** read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

*  *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288  ops/s|
|Caffeine|62507109.307 ± 72055321.581  ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711  ops/s|
|Guava|15102099.448 ±  6613818.000  ops/s|

*  *write only thrpt test(8 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079  ops/s|
|Caffeine|27899813.470 ± 11399461.937  ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561  ops/s|
|Guava|7675161.183 ±  6730863.169  ops/s|

It seems that the performance of caffeine is better than that of guava cache



> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:19 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

** read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

*  *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288  ops/s|
|Caffeine|62507109.307 ± 72055321.581  ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711  ops/s|
|Guava|15102099.448 ±  6613818.000  ops/s|

*  *write only thrpt test(8 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079  ops/s|
|Caffeine|27899813.470 ± 11399461.937  ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561  ops/s|
|Guava|7675161.183 ±  6730863.169  ops/s|



was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

** read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

*  *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288  ops/s|
|Caffeine|62507109.307 ± 72055321.581  ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711  ops/s|
|Guava|15102099.448 ±  6613818.000  ops/s|

*  *write only thrpt test(8 threads write):*
||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079  ops/s|
|Caffeine|27899813.470 ± 11399461.937  ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561  ops/s|
|Guava|7675161.183 ±  6730863.169  ops/s|


> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:13 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write (6 threads read and 2 threads write):

 !screenshot-1.png! 

# read only (8 threads read):

 !image-2021-02-05-18-08-48-852.png! 

# write only (8 threads write):



was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write (6 read and 2 write):

 !screenshot-1.png! 

# read only

 !image-2021-02-05-18-08-48-852.png! 

> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:16 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write thrpt test (6 threads read and 2 threads write):

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

# read only (8 threads read):

 !image-2021-02-05-18-08-48-852.png! 

# write only (8 threads write):



was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write (6 threads read and 2 threads write):

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

# read only (8 threads read):

 !image-2021-02-05-18-08-48-852.png! 

# write only (8 threads write):


> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:20 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

** read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

*  *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288  ops/s|
|Caffeine|62507109.307 ± 72055321.581  ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711  ops/s|
|Guava|15102099.448 ±  6613818.000  ops/s|

*  *write only thrpt test(8 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079  ops/s|
|Caffeine|27899813.470 ± 11399461.937  ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561  ops/s|
|Guava|7675161.183 ±  6730863.169  ops/s|

It seems that the performance of caffeine is better than that of guava cache




was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

** read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

*  *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288  ops/s|
|Caffeine|62507109.307 ± 72055321.581  ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711  ops/s|
|Guava|15102099.448 ±  6613818.000  ops/s|

*  *write only thrpt test(8 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079  ops/s|
|Caffeine|27899813.470 ± 11399461.937  ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561  ops/s|
|Guava|7675161.183 ±  6730863.169  ops/s|


> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:19 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

** read write thrpt test (6 threads read and 2 threads write):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

*  *read only thrpt test(8 threads read):*

||cacheType||Score(Units)||
|LinkedHashMap_Lru|10976457.064 ± 26703667.288  ops/s|
|Caffeine|62507109.307 ± 72055321.581  ops/s|
|ConcurrentLinkedHashMap|29978890.384 ± 22266401.711  ops/s|
|Guava|15102099.448 ±  6613818.000  ops/s|

*  *write only thrpt test(8 threads write):*
||cacheType||Score(Units)||
|LinkedHashMap_Lru|9776881.873 ± 25659900.079  ops/s|
|Caffeine|27899813.470 ± 11399461.937  ops/s|
|ConcurrentLinkedHashMap|15839475.472 ± 18312421.561  ops/s|
|Guava|7675161.183 ±  6730863.169  ops/s|



was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write thrpt test (6 threads read and 2 threads write):

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

# read only (8 threads read):

 !image-2021-02-05-18-08-48-852.png! 

# write only (8 threads write):


> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:15 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write (6 threads read and 2 threads write):

||cacheType||Score(Units)||
|LinkedHashMap_Lru|11248568.197 ± 12737170.642  ops/s|
|Caffeine|44638046.442 ± 23455184.501  ops/s|
|ConcurrentLinkedHashMap|30133621.108 ± 21890181.254  ops/s|
|Guava|1486.653 ±  5716763.921  ops/s|

# read only (8 threads read):

 !image-2021-02-05-18-08-48-852.png! 

# write only (8 threads write):



was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write (6 threads read and 2 threads write):

 !screenshot-1.png! 

# read only (8 threads read):

 !image-2021-02-05-18-08-48-852.png! 

# write only (8 threads write):


> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie edited comment on SPARK-34309 at 2/5/21, 10:08 AM:


[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write (6 read and 2 write):

 !screenshot-1.png! 

# read only

 !image-2021-02-05-18-08-48-852.png! 


was (Author: luciferyang):
[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write (6 read and 2 write):

 !screenshot-1.png! 

# read only



> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279548#comment-17279548
 ] 

Yang Jie commented on SPARK-34309:
--

[~dongjoon] I use the code in 
[GetPutBenchmark|https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java]
 to compare the performance of caffeine, guava cache, LinkedHashMap and 
ConcurrentLinkedHashMap, the result as follows:

# read write (6 read and 2 write):

 !screenshot-1.png! 

# read only



> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34375) Replaces `Mockito.initMocks` with `Mockito.openMocks`

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34375:


Assignee: Apache Spark

> Replaces `Mockito.initMocks` with `Mockito.openMocks`
> -
>
> Key: SPARK-34375
> URL: https://issues.apache.org/jira/browse/SPARK-34375
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Mockito.initMocks is a deprecated api, should use openMocks(Object) instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34375) Replaces `Mockito.initMocks` with `Mockito.openMocks`

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34375:


Assignee: (was: Apache Spark)

> Replaces `Mockito.initMocks` with `Mockito.openMocks`
> -
>
> Key: SPARK-34375
> URL: https://issues.apache.org/jira/browse/SPARK-34375
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Mockito.initMocks is a deprecated api, should use openMocks(Object) instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34375) Replaces `Mockito.initMocks` with `Mockito.openMocks`

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279491#comment-17279491
 ] 

Apache Spark commented on SPARK-34375:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31487

> Replaces `Mockito.initMocks` with `Mockito.openMocks`
> -
>
> Key: SPARK-34375
> URL: https://issues.apache.org/jira/browse/SPARK-34375
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core, Tests
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Mockito.initMocks is a deprecated api, should use openMocks(Object) instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-05 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-34309:
-
Attachment: screenshot-1.png

> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279447#comment-17279447
 ] 

Apache Spark commented on SPARK-34359:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31486

> add a legacy config to restore the output schema of SHOW DATABASES
> --
>
> Key: SPARK-34359
> URL: https://issues.apache.org/jira/browse/SPARK-34359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34376) Support regexp as a function

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34376:


Assignee: (was: Apache Spark)

> Support regexp as a function
> 
>
> Key: SPARK-34376
> URL: https://issues.apache.org/jira/browse/SPARK-34376
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
> We seemed to miss adding REGEXP as a SQL function just like RLIKE
> This is also registered in Hive as a function, we can reduce the migration 
> plan for those users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34376) Support regexp as a function

2021-02-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34376:


Assignee: Apache Spark

> Support regexp as a function
> 
>
> Key: SPARK-34376
> URL: https://issues.apache.org/jira/browse/SPARK-34376
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
> We seemed to miss adding REGEXP as a SQL function just like RLIKE
> This is also registered in Hive as a function, we can reduce the migration 
> plan for those users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34376) Support regexp as a function

2021-02-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279495#comment-17279495
 ] 

Apache Spark commented on SPARK-34376:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31488

> Support regexp as a function
> 
>
> Key: SPARK-34376
> URL: https://issues.apache.org/jira/browse/SPARK-34376
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
> We seemed to miss adding REGEXP as a SQL function just like RLIKE
> This is also registered in Hive as a function, we can reduce the migration 
> plan for those users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34376) Support regexp as a function

2021-02-05 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34376:
-
Summary: Support regexp as a function  (was: Support regexp as function)

> Support regexp as a function
> 
>
> Key: SPARK-34376
> URL: https://issues.apache.org/jira/browse/SPARK-34376
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
> We seemed to miss adding REGEXP as a SQL function just like RLIKE
> This is also registered in Hive as a function, we can reduce the migration 
> plan for those users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34376) Support regexp as function

2021-02-05 Thread Kent Yao (Jira)

Kent Yao created SPARK-34376:


 Summary: Support regexp as function
 Key: SPARK-34376
 URL: https://issues.apache.org/jira/browse/SPARK-34376
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.4.7, 3.1.0
Reporter: Kent Yao


We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
We seemed to miss adding REGEXP as a SQL function just like RLIKE

This is also registered in Hive as a function, we can reduce the migration plan 
for those users




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34375) Replaces `Mockito.initMocks` with `Mockito.openMocks`

2021-02-05 Thread Yang Jie (Jira)

Yang Jie created SPARK-34375:


 Summary: Replaces `Mockito.initMocks` with `Mockito.openMocks`
 Key: SPARK-34375
 URL: https://issues.apache.org/jira/browse/SPARK-34375
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Core, Tests
Affects Versions: 3.2.0
Reporter: Yang Jie


Mockito.initMocks is a deprecated api, should use openMocks(Object) instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34376) Support regexp as function

2021-02-05 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34376:
-
Issue Type: New Feature  (was: Bug)

> Support regexp as function
> --
>
> Key: SPARK-34376
> URL: https://issues.apache.org/jira/browse/SPARK-34376
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> We have equality in SqlBase.g4 for RLIKE: 'RLIKE' | 'REGEXP';
> We seemed to miss adding REGEXP as a SQL function just like RLIKE
> This is also registered in Hive as a function, we can reduce the migration 
> plan for those users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34134) LDAP authentication of spark thrift server support user id mapping

2021-02-05 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279472#comment-17279472
 ] 

angerszhu commented on SPARK-34134:
---

Seems not a common usage.

> LDAP authentication of spark thrift server support user id mapping
> --
>
> Key: SPARK-34134
> URL: https://issues.apache.org/jira/browse/SPARK-34134
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.0.1
>Reporter: Timothy Zhang
>Priority: Major
>
> I'm trying to configure LDAP authentication of spark thrift server, and would 
> like to implement user id mapping to mail address.
> My scenario is, "uid" is the key of our LDAP system, and "mail"(email 
> address) is one of attributes. Now we want users to input email address, i.e. 
> "mail" when they login thrift client. That is to map "username" input to mail 
> attribute query. e.g.
> {code:none}
> hive.server2.authentication.ldap.customLDAPQuery="(&(objectClass=person)(mail=${uid}))"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34331) Speed up DS v2 metadata col resolution

2021-02-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34331.
-
Fix Version/s: 3.1.1
   Resolution: Fixed

Issue resolved by pull request 31440
[https://github.com/apache/spark/pull/31440]

> Speed up DS v2 metadata col resolution
> --
>
> Key: SPARK-34331
> URL: https://issues.apache.org/jira/browse/SPARK-34331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 3.1.1
>
>
> There is a performance regression in Spark 3.1.1. Please refer to the PR 
> description since the fix is ready.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

89 matches

Mail list logo