[jira] [Assigned] (SPARK-35985) File source V2 ignores partition filters when empty readDataSchema

2021-07-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35985:
---

Assignee: Steven Aerts

> File source V2 ignores partition filters when empty readDataSchema
> --
>
> Key: SPARK-35985
> URL: https://issues.apache.org/jira/browse/SPARK-35985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
>
> A V2 datasource fails to rely on partition filters when it only wants to know 
> how many entries there are, and is not interested of their context.
> So when the {{readDataSchema}} of the {{FileScan}} is empty, partition 
> filters are not pushed down and all data is scanned.
> Some examples where this happens:
> {code:java}
> scala> spark.sql("SELECT count(*) FROM parq WHERE day=20210702").explain
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#136]
>  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
>  +- *(1) Project
>  +- *(1) Filter (isnotnull(day#68) AND (day#68 = 20210702))
>  +- *(1) ColumnarToRow
>  +- BatchScan[day#68] ParquetScan DataFilters: [], Format: parquet, Location: 
> InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilers: 
> [IsNotNull(day), EqualTo(day,20210702)], ReadSchema: struct<>, PushedFilters: 
> [IsNotNull(day), EqualTo(day,20210702)]
> scala> spark.sql("SELECT input_file_name() FROM parq WHERE 
> day=20210702").explain
> == Physical Plan ==
> *(1) Project [input_file_name() AS input_file_name()#131]
> +- *(1) Filter (isnotnull(day#68) AND (day#68 = 20210702))
>  +- *(1) ColumnarToRow
>  +- BatchScan[day#68] ParquetScan DataFilters: [], Format: parquet, Location: 
> InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilers: 
> [IsNotNull(day), EqualTo(day,20210702)], ReadSchema: struct<>, PushedFilters: 
> [IsNotNull(day), EqualTo(day,20210702)]
> {code}
>  
> Once the {{readDataSchema}} is not empty, it works correctly:
> {code:java}
> scala> spark.sql("SELECT header.tenant FROM parq WHERE day=20210702").explain
> == Physical Plan ==
> *(1) Project [header#51.tenant AS tenant#199]
> +- BatchScan[header#51, day#68] ParquetScan DataFilters: [], Format: parquet, 
> Location: InMemoryFileIndex[file:/..., PartitionFilters: [isnotnull(day#68), 
> (day#68 = 20210702)], PushedFilers: [IsNotNull(day), EqualTo(day,20210702)], 
> ReadSchema: struct>, PushedFilters: 
> [IsNotNull(day), EqualTo(day,20210702)]{code}
>  
> In V1 this optimization is available:
> {code:java}
> scala> spark.sql("SELECT count(*) FROM parq WHERE day=20210702").explain
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#27]
>  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
>  +- *(1) Project
>  +- *(1) ColumnarToRow
>  +- FileScan parquet [year#15,month#16,day#17,hour#18] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
> PartitionFilters: [isnotnull(day#17), (day#17 = 20210702)], PushedFilters: 
> [], ReadSchema: struct<>{code}
> The examples use {{ParquetScan}}, but the problem happens for all File based 
> V2 datasources.
> The fix for this issue feels very straight forward. In 
> {{PruneFileSourcePartitions}} queries with an empty {{readDataSchema}} are 
> explicitly excluded from being pushed down:
> {code:java}
> if filters.nonEmpty && scan.readDataSchema.nonEmpty =>{code}
> Removing that condition seems to fix the issue however, this might be too 
> naive.
> I am making a PR with tests where this change can be discussed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35985) File source V2 ignores partition filters when empty readDataSchema

2021-07-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35985.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33191
[https://github.com/apache/spark/pull/33191]

> File source V2 ignores partition filters when empty readDataSchema
> --
>
> Key: SPARK-35985
> URL: https://issues.apache.org/jira/browse/SPARK-35985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
> Fix For: 3.2.0
>
>
> A V2 datasource fails to rely on partition filters when it only wants to know 
> how many entries there are, and is not interested of their context.
> So when the {{readDataSchema}} of the {{FileScan}} is empty, partition 
> filters are not pushed down and all data is scanned.
> Some examples where this happens:
> {code:java}
> scala> spark.sql("SELECT count(*) FROM parq WHERE day=20210702").explain
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#136]
>  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
>  +- *(1) Project
>  +- *(1) Filter (isnotnull(day#68) AND (day#68 = 20210702))
>  +- *(1) ColumnarToRow
>  +- BatchScan[day#68] ParquetScan DataFilters: [], Format: parquet, Location: 
> InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilers: 
> [IsNotNull(day), EqualTo(day,20210702)], ReadSchema: struct<>, PushedFilters: 
> [IsNotNull(day), EqualTo(day,20210702)]
> scala> spark.sql("SELECT input_file_name() FROM parq WHERE 
> day=20210702").explain
> == Physical Plan ==
> *(1) Project [input_file_name() AS input_file_name()#131]
> +- *(1) Filter (isnotnull(day#68) AND (day#68 = 20210702))
>  +- *(1) ColumnarToRow
>  +- BatchScan[day#68] ParquetScan DataFilters: [], Format: parquet, Location: 
> InMemoryFileIndex[file:/..., PartitionFilters: [], PushedFilers: 
> [IsNotNull(day), EqualTo(day,20210702)], ReadSchema: struct<>, PushedFilters: 
> [IsNotNull(day), EqualTo(day,20210702)]
> {code}
>  
> Once the {{readDataSchema}} is not empty, it works correctly:
> {code:java}
> scala> spark.sql("SELECT header.tenant FROM parq WHERE day=20210702").explain
> == Physical Plan ==
> *(1) Project [header#51.tenant AS tenant#199]
> +- BatchScan[header#51, day#68] ParquetScan DataFilters: [], Format: parquet, 
> Location: InMemoryFileIndex[file:/..., PartitionFilters: [isnotnull(day#68), 
> (day#68 = 20210702)], PushedFilers: [IsNotNull(day), EqualTo(day,20210702)], 
> ReadSchema: struct>, PushedFilters: 
> [IsNotNull(day), EqualTo(day,20210702)]{code}
>  
> In V1 this optimization is available:
> {code:java}
> scala> spark.sql("SELECT count(*) FROM parq WHERE day=20210702").explain
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#27]
>  +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
>  +- *(1) Project
>  +- *(1) ColumnarToRow
>  +- FileScan parquet [year#15,month#16,day#17,hour#18] Batched: true, 
> DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
> PartitionFilters: [isnotnull(day#17), (day#17 = 20210702)], PushedFilters: 
> [], ReadSchema: struct<>{code}
> The examples use {{ParquetScan}}, but the problem happens for all File based 
> V2 datasources.
> The fix for this issue feels very straight forward. In 
> {{PruneFileSourcePartitions}} queries with an empty {{readDataSchema}} are 
> explicitly excluded from being pushed down:
> {code:java}
> if filters.nonEmpty && scan.readDataSchema.nonEmpty =>{code}
> Removing that condition seems to fix the issue however, this might be too 
> naive.
> I am making a PR with tests where this change can be discussed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36174) Support explain final plan in AQE

2021-07-15 Thread XiDuo You (Jira)
XiDuo You created SPARK-36174:
-

 Summary: Support explain final plan in AQE
 Key: SPARK-36174
 URL: https://issues.apache.org/jira/browse/SPARK-36174
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


The executed plan will be changed during running in AQE, however the current 
implementation of explain does not support this.

As the AQE is enabled by default, user may want to get the final plan using 
query, so it should make sense to add a new grammar to support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36173) [CORE] Support getting CPU number in TaskContext

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36173:


Assignee: Apache Spark

> [CORE] Support getting CPU number in TaskContext
> 
>
> Key: SPARK-36173
> URL: https://issues.apache.org/jira/browse/SPARK-36173
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Xiaochang Wu
>Assignee: Apache Spark
>Priority: Major
>
> In stage-level resource scheduling, the allocated 3rd party resources can be 
> obtained in TaskContext using resources() interface, however there is no API 
> to get how many cpus are allocated for the task. Will add a cpus() interface 
> to TaskContext in addition to resources().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36173) [CORE] Support getting CPU number in TaskContext

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36173:


Assignee: (was: Apache Spark)

> [CORE] Support getting CPU number in TaskContext
> 
>
> Key: SPARK-36173
> URL: https://issues.apache.org/jira/browse/SPARK-36173
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Xiaochang Wu
>Priority: Major
>
> In stage-level resource scheduling, the allocated 3rd party resources can be 
> obtained in TaskContext using resources() interface, however there is no API 
> to get how many cpus are allocated for the task. Will add a cpus() interface 
> to TaskContext in addition to resources().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36173) [CORE] Support getting CPU number in TaskContext

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381741#comment-17381741
 ] 

Apache Spark commented on SPARK-36173:
--

User 'xwu99' has created a pull request for this issue:
https://github.com/apache/spark/pull/33385

> [CORE] Support getting CPU number in TaskContext
> 
>
> Key: SPARK-36173
> URL: https://issues.apache.org/jira/browse/SPARK-36173
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Xiaochang Wu
>Priority: Major
>
> In stage-level resource scheduling, the allocated 3rd party resources can be 
> obtained in TaskContext using resources() interface, however there is no API 
> to get how many cpus are allocated for the task. Will add a cpus() interface 
> to TaskContext in addition to resources().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36173) [CORE] Support getting CPU number in TaskContext

2021-07-15 Thread Xiaochang Wu (Jira)
Xiaochang Wu created SPARK-36173:


 Summary: [CORE] Support getting CPU number in TaskContext
 Key: SPARK-36173
 URL: https://issues.apache.org/jira/browse/SPARK-36173
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Xiaochang Wu


In stage-level resource scheduling, the allocated 3rd party resources can be 
obtained in TaskContext using resources() interface, however there is no API to 
get how many cpus are allocated for the task. Will add a cpus() interface to 
TaskContext in addition to resources().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36171) Upgrade GenJavadoc to 0.18

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36171.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33383
[https://github.com/apache/spark/pull/33383]

> Upgrade GenJavadoc to 0.18
> --
>
> Key: SPARK-36171
> URL: https://issues.apache.org/jira/browse/SPARK-36171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> GenJavadoc 0.18 was released, which includes a bug fix for Scala 2.13.
> https://github.com/lightbend/genjavadoc/releases/tag/v0.18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36171) Upgrade GenJavadoc to 0.18

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36171:
--
Affects Version/s: 3.2.0

> Upgrade GenJavadoc to 0.18
> --
>
> Key: SPARK-36171
> URL: https://issues.apache.org/jira/browse/SPARK-36171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> GenJavadoc 0.18 was released, which includes a bug fix for Scala 2.13.
> https://github.com/lightbend/genjavadoc/releases/tag/v0.18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36172) Document session window in Structured Streaming guide doc

2021-07-15 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-36172:


 Summary: Document session window in Structured Streaming guide doc
 Key: SPARK-36172
 URL: https://issues.apache.org/jira/browse/SPARK-36172
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Jungtaek Lim


Given we ship the new feature "native support of session window", we should 
also document a new feature in Structured Streaming guide doc so that end users 
can leverage it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36034:
-
Fix Version/s: 3.1.3

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36167) Revisit more InternalField managements.

2021-07-15 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36167.
---
Fix Version/s: 3.2.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Issue resolved by pull request 33377
https://github.com/apache/spark/pull/33377

> Revisit more InternalField managements.
> ---
>
> Key: SPARK-36167
> URL: https://issues.apache.org/jira/browse/SPARK-36167
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> There are other places we can manage {{InternalField}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36167) Revisit more InternalField managements.

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381689#comment-17381689
 ] 

Apache Spark commented on SPARK-36167:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33384

> Revisit more InternalField managements.
> ---
>
> Key: SPARK-36167
> URL: https://issues.apache.org/jira/browse/SPARK-36167
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> There are other places we can manage {{InternalField}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36169) Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)

2021-07-15 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381680#comment-17381680
 ] 

Hyukjin Kwon commented on SPARK-36169:
--

fixed in https://github.com/apache/spark/pull/33381

> Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as 
> documneted)
> --
>
> Key: SPARK-36169
> URL: https://issues.apache.org/jira/browse/SPARK-36169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> {{spark.sql.sources.disabledJdbcConnProviderList}} is supposed to be a static 
> config (it doesn't work runtime anyway) but it's currently placed as a 
> runtime config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36169) Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)

2021-07-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36169.
--
Fix Version/s: 3.3.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

> Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as 
> documneted)
> --
>
> Key: SPARK-36169
> URL: https://issues.apache.org/jira/browse/SPARK-36169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> {{spark.sql.sources.disabledJdbcConnProviderList}} is supposed to be a static 
> config (it doesn't work runtime anyway) but it's currently placed as a 
> runtime config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36146) Upgrade Python version from 3.6 to higher version in GitHub linter

2021-07-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36146:
-
Fix Version/s: 3.2.0

> Upgrade Python version from 3.6 to higher version in GitHub linter
> --
>
> Key: SPARK-36146
> URL: https://issues.apache.org/jira/browse/SPARK-36146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>
> MyPy checks fails with higher Python versions. For example, with Python 3.8:
> {code}
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name 
> "np.ndarray" is not defined
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name 
> "np.recarray" is not defined
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name 
> "np.ndarray" is not defined
> python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, 
> Any]" of "toArray" incompatible with return type "NoReturn" in supertype 
> "Matrix"
> python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, 
> Any]" of "toArray" incompatible with return type "NoReturn" in supertype 
> "Matrix"
> python/pyspark/pandas/typedef/typehints.py:163: error: Module has no 
> attribute "bool"; maybe "bool_" or "bool8"?
> python/pyspark/pandas/typedef/typehints.py:174: error: Module has no 
> attribute "float"; maybe "float_", "cfloat", or "float96"?
> python/pyspark/pandas/typedef/typehints.py:180: error: Module has no 
> attribute "int"; maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/ml.py:81: error: Value of type variable 
> "_DTypeScalar_co" of "dtype" cannot be "object"
> python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; 
> maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; 
> maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not 
> valid as a type
> python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" 
> or a callback protocol?
> python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not 
> valid as a type
> python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" 
> or a callback protocol?
> python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute 
> "tolist"
> python/pyspark/pandas/series.py:1030: error: Module has no attribute 
> "_NoValue"
> python/pyspark/pandas/series.py:1031: error: Module has no attribute 
> "_NoValue"
> python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" 
> has incompatible type "float"; expected "str"
> python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment 
> (expression has type "Type[floating[Any]]", variable has type "str")
> python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:100: error: Name "np.float" is 
> not defined
> python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is 
> not defined
> python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in 
> assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" 
> defined the type as "List[ndarray[Any, Any]]")
> python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" 
> incompatible with supertype "LinearClassificationModel"
> Found 32 errors in 15 files (checked 315 source files)
> 1
> {code}



--
This message was sent by 

[jira] [Assigned] (SPARK-36171) Upgrade GenJavadoc to 0.18

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36171:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade GenJavadoc to 0.18
> --
>
> Key: SPARK-36171
> URL: https://issues.apache.org/jira/browse/SPARK-36171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> GenJavadoc 0.18 was released, which includes a bug fix for Scala 2.13.
> https://github.com/lightbend/genjavadoc/releases/tag/v0.18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36171) Upgrade GenJavadoc to 0.18

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36171:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade GenJavadoc to 0.18
> --
>
> Key: SPARK-36171
> URL: https://issues.apache.org/jira/browse/SPARK-36171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> GenJavadoc 0.18 was released, which includes a bug fix for Scala 2.13.
> https://github.com/lightbend/genjavadoc/releases/tag/v0.18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36137:


Assignee: Apache Spark

> HiveShim always fallback to getAllPartitionsOf regardless of whether 
> directSQL is enabled in remote HMS
> ---
>
> Key: SPARK-36137
> URL: https://issues.apache.org/jira/browse/SPARK-36137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
> {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in 
> the remote HMS. However, in certain cases the remote HMS will fallback to use 
> ORM (which only support string type for partition columns) to query the 
> underlying RDBMS even if this config is set to true, and Spark will not be 
> able to recover from the error and will just fail the query. 
> For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, 
> and Spark was not able to pushdown filter for {{date}} column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36171) Upgrade GenJavadoc to 0.18

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381671#comment-17381671
 ] 

Apache Spark commented on SPARK-36171:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33383

> Upgrade GenJavadoc to 0.18
> --
>
> Key: SPARK-36171
> URL: https://issues.apache.org/jira/browse/SPARK-36171
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> GenJavadoc 0.18 was released, which includes a bug fix for Scala 2.13.
> https://github.com/lightbend/genjavadoc/releases/tag/v0.18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381669#comment-17381669
 ] 

Apache Spark commented on SPARK-36137:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33382

> HiveShim always fallback to getAllPartitionsOf regardless of whether 
> directSQL is enabled in remote HMS
> ---
>
> Key: SPARK-36137
> URL: https://issues.apache.org/jira/browse/SPARK-36137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
> {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in 
> the remote HMS. However, in certain cases the remote HMS will fallback to use 
> ORM (which only support string type for partition columns) to query the 
> underlying RDBMS even if this config is set to true, and Spark will not be 
> able to recover from the error and will just fail the query. 
> For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, 
> and Spark was not able to pushdown filter for {{date}} column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36137:


Assignee: (was: Apache Spark)

> HiveShim always fallback to getAllPartitionsOf regardless of whether 
> directSQL is enabled in remote HMS
> ---
>
> Key: SPARK-36137
> URL: https://issues.apache.org/jira/browse/SPARK-36137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
> {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in 
> the remote HMS. However, in certain cases the remote HMS will fallback to use 
> ORM (which only support string type for partition columns) to query the 
> underlying RDBMS even if this config is set to true, and Spark will not be 
> able to recover from the error and will just fail the query. 
> For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, 
> and Spark was not able to pushdown filter for {{date}} column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36137) HiveShim always fallback to getAllPartitionsOf regardless of whether directSQL is enabled in remote HMS

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381670#comment-17381670
 ] 

Apache Spark commented on SPARK-36137:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33382

> HiveShim always fallback to getAllPartitionsOf regardless of whether 
> directSQL is enabled in remote HMS
> ---
>
> Key: SPARK-36137
> URL: https://issues.apache.org/jira/browse/SPARK-36137
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> At the moment {{getPartitionsByFilter}} in Hive shim only fallback to use 
> {{getAllPartitionsOf}} when {{hive.metastore.try.direct.sql}} is enabled in 
> the remote HMS. However, in certain cases the remote HMS will fallback to use 
> ORM (which only support string type for partition columns) to query the 
> underlying RDBMS even if this config is set to true, and Spark will not be 
> able to recover from the error and will just fail the query. 
> For instance, we encountered this bug HIVE-21497 in HMS running Hive 3.1.2, 
> and Spark was not able to pushdown filter for {{date}} column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36169) Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381667#comment-17381667
 ] 

Apache Spark commented on SPARK-36169:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33381

> Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as 
> documneted)
> --
>
> Key: SPARK-36169
> URL: https://issues.apache.org/jira/browse/SPARK-36169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {{spark.sql.sources.disabledJdbcConnProviderList}} is supposed to be a static 
> config (it doesn't work runtime anyway) but it's currently placed as a 
> runtime config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36169) Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381664#comment-17381664
 ] 

Apache Spark commented on SPARK-36169:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/33381

> Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as 
> documneted)
> --
>
> Key: SPARK-36169
> URL: https://issues.apache.org/jira/browse/SPARK-36169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {{spark.sql.sources.disabledJdbcConnProviderList}} is supposed to be a static 
> config (it doesn't work runtime anyway) but it's currently placed as a 
> runtime config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36169) Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36169:


Assignee: (was: Apache Spark)

> Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as 
> documneted)
> --
>
> Key: SPARK-36169
> URL: https://issues.apache.org/jira/browse/SPARK-36169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {{spark.sql.sources.disabledJdbcConnProviderList}} is supposed to be a static 
> config (it doesn't work runtime anyway) but it's currently placed as a 
> runtime config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36169) Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36169:


Assignee: Apache Spark

> Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as 
> documneted)
> --
>
> Key: SPARK-36169
> URL: https://issues.apache.org/jira/browse/SPARK-36169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> {{spark.sql.sources.disabledJdbcConnProviderList}} is supposed to be a static 
> config (it doesn't work runtime anyway) but it's currently placed as a 
> runtime config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36166.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33376
[https://github.com/apache/spark/pull/33376]

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36171) Upgrade GenJavadoc to 0.18

2021-07-15 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-36171:
--

 Summary: Upgrade GenJavadoc to 0.18
 Key: SPARK-36171
 URL: https://issues.apache.org/jira/browse/SPARK-36171
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


GenJavadoc 0.18 was released, which includes a bug fix for Scala 2.13.
https://github.com/lightbend/genjavadoc/releases/tag/v0.18



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35810) Deprecate ps.broadcast API

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35810:


Assignee: Apache Spark

> Deprecate ps.broadcast API
> --
>
> Key: SPARK-35810
> URL: https://issues.apache.org/jira/browse/SPARK-35810
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We have 
> [ps.broadcast|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.broadcast.html]
>  in pandas API on Spark, but it's duplicated with 
> [DataFrame.spark.hint|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.hint.html]
>  when using this API with "broadcast".
> So, we'd better deprecate this and 
> [broadcast|http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.broadcast.html]
>  function in PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35810) Deprecate ps.broadcast API

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381657#comment-17381657
 ] 

Apache Spark commented on SPARK-35810:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/33379

> Deprecate ps.broadcast API
> --
>
> Key: SPARK-35810
> URL: https://issues.apache.org/jira/browse/SPARK-35810
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We have 
> [ps.broadcast|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.broadcast.html]
>  in pandas API on Spark, but it's duplicated with 
> [DataFrame.spark.hint|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.hint.html]
>  when using this API with "broadcast".
> So, we'd better deprecate this and 
> [broadcast|http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.broadcast.html]
>  function in PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35810) Deprecate ps.broadcast API

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35810:


Assignee: (was: Apache Spark)

> Deprecate ps.broadcast API
> --
>
> Key: SPARK-35810
> URL: https://issues.apache.org/jira/browse/SPARK-35810
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We have 
> [ps.broadcast|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.broadcast.html]
>  in pandas API on Spark, but it's duplicated with 
> [DataFrame.spark.hint|https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.hint.html]
>  when using this API with "broadcast".
> So, we'd better deprecate this and 
> [broadcast|http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.broadcast.html]
>  function in PySpark as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36170) Change quoted interval literal (interval constructor) to be converted to ANSI interval types

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381659#comment-17381659
 ] 

Apache Spark commented on SPARK-36170:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33380

> Change quoted interval literal (interval constructor) to be converted to ANSI 
> interval types
> 
>
> Key: SPARK-36170
> URL: https://issues.apache.org/jira/browse/SPARK-36170
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> The tnit-to-unit interval literals and the unit list interval literals are 
> converted to ANSI interval types but quoted interval literals are still 
> converted to CalendarIntervalType.
> {code}
> -- Unit list interval literals
> spark-sql> select interval 1 year 2 month;
> 1-2
> -- Quoted interval literals
> spark-sql> select interval '1 year 2 month';
> 1 years 2 months
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36170) Change quoted interval literal (interval constructor) to be converted to ANSI interval types

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36170:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Change quoted interval literal (interval constructor) to be converted to ANSI 
> interval types
> 
>
> Key: SPARK-36170
> URL: https://issues.apache.org/jira/browse/SPARK-36170
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> The tnit-to-unit interval literals and the unit list interval literals are 
> converted to ANSI interval types but quoted interval literals are still 
> converted to CalendarIntervalType.
> {code}
> -- Unit list interval literals
> spark-sql> select interval 1 year 2 month;
> 1-2
> -- Quoted interval literals
> spark-sql> select interval '1 year 2 month';
> 1 years 2 months
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36170) Change quoted interval literal (interval constructor) to be converted to ANSI interval types

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381658#comment-17381658
 ] 

Apache Spark commented on SPARK-36170:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33380

> Change quoted interval literal (interval constructor) to be converted to ANSI 
> interval types
> 
>
> Key: SPARK-36170
> URL: https://issues.apache.org/jira/browse/SPARK-36170
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> The tnit-to-unit interval literals and the unit list interval literals are 
> converted to ANSI interval types but quoted interval literals are still 
> converted to CalendarIntervalType.
> {code}
> -- Unit list interval literals
> spark-sql> select interval 1 year 2 month;
> 1-2
> -- Quoted interval literals
> spark-sql> select interval '1 year 2 month';
> 1 years 2 months
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36170) Change quoted interval literal (interval constructor) to be converted to ANSI interval types

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36170:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Change quoted interval literal (interval constructor) to be converted to ANSI 
> interval types
> 
>
> Key: SPARK-36170
> URL: https://issues.apache.org/jira/browse/SPARK-36170
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> The tnit-to-unit interval literals and the unit list interval literals are 
> converted to ANSI interval types but quoted interval literals are still 
> converted to CalendarIntervalType.
> {code}
> -- Unit list interval literals
> spark-sql> select interval 1 year 2 month;
> 1-2
> -- Quoted interval literals
> spark-sql> select interval '1 year 2 month';
> 1 years 2 months
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36170) Change quoted interval literal (interval constructor) to be converted to ANSI interval types

2021-07-15 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-36170:
--

 Summary: Change quoted interval literal (interval constructor) to 
be converted to ANSI interval types
 Key: SPARK-36170
 URL: https://issues.apache.org/jira/browse/SPARK-36170
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0, 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


The tnit-to-unit interval literals and the unit list interval literals are 
converted to ANSI interval types but quoted interval literals are still 
converted to CalendarIntervalType.

{code}
-- Unit list interval literals
spark-sql> select interval 1 year 2 month;
1-2
-- Quoted interval literals
spark-sql> select interval '1 year 2 month';
1 years 2 months
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36134) jackson-databind RCE vulnerability

2021-07-15 Thread Sumit (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381642#comment-17381642
 ] 

Sumit commented on SPARK-36134:
---

[~xkrogen] Thank you 

[https://spark.apache.org/docs/latest/index.html]    (3.1.2) released.

 

 

> jackson-databind RCE vulnerability
> --
>
> Key: SPARK-36134
> URL: https://issues.apache.org/jira/browse/SPARK-36134
> Project: Spark
>  Issue Type: Task
>  Components: Java API
>Affects Versions: 3.1.2, 3.1.3
>Reporter: Sumit
>Priority: Major
> Attachments: Screenshot 2021-07-15 at 1.00.55 PM.png
>
>
> Need to upgrade   jackson-databind version to *2.9.3.1*
> At the beginning of 2018, jackson-databind was reported to contain another 
> remote code execution (RCE) vulnerability (CVE-2017-17485) that affects 
> versions 2.9.3 and earlier, 2.7.9.1 and earlier, and 2.8.10 and earlier. This 
> vulnerability is caused by jackson-dababind’s incomplete blacklist. An 
> application that uses jackson-databind will become vulnerable when the 
> enableDefaultTyping method is called via the ObjectMapper object within the 
> application. An attacker can thus compromise the application by sending 
> maliciously crafted JSON input to gain direct control over a server. 
> Currently, a proof of concept (POC) exploit for this vulnerability has been 
> publicly available. All users who are affected by this vulnerability should 
> upgrade to the latest versions as soon as possible to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36169) Make 'spark.sql.sources.disabledJdbcConnProviderList' as a static conf (as documneted)

2021-07-15 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36169:


 Summary: Make 'spark.sql.sources.disabledJdbcConnProviderList' as 
a static conf (as documneted)
 Key: SPARK-36169
 URL: https://issues.apache.org/jira/browse/SPARK-36169
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


{{spark.sql.sources.disabledJdbcConnProviderList}} is supposed to be a static 
config (it doesn't work runtime anyway) but it's currently placed as a runtime 
config



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381620#comment-17381620
 ] 

Apache Spark commented on SPARK-32922:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/33378

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0
>
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381619#comment-17381619
 ] 

Apache Spark commented on SPARK-32922:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/33378

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0
>
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36167) Revisit more InternalField managements.

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36167:


Assignee: Apache Spark

> Revisit more InternalField managements.
> ---
>
> Key: SPARK-36167
> URL: https://issues.apache.org/jira/browse/SPARK-36167
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> There are other places we can manage {{InternalField}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36167) Revisit more InternalField managements.

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381618#comment-17381618
 ] 

Apache Spark commented on SPARK-36167:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33377

> Revisit more InternalField managements.
> ---
>
> Key: SPARK-36167
> URL: https://issues.apache.org/jira/browse/SPARK-36167
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are other places we can manage {{InternalField}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36167) Revisit more InternalField managements.

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36167:


Assignee: (was: Apache Spark)

> Revisit more InternalField managements.
> ---
>
> Key: SPARK-36167
> URL: https://issues.apache.org/jira/browse/SPARK-36167
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are other places we can manage {{InternalField}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36167) Revisit more InternalField managements.

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381617#comment-17381617
 ] 

Apache Spark commented on SPARK-36167:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/33377

> Revisit more InternalField managements.
> ---
>
> Key: SPARK-36167
> URL: https://issues.apache.org/jira/browse/SPARK-36167
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are other places we can manage {{InternalField}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36152) Add Scala 2.13 daily build and test GitHub Action job

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36152:
-

Assignee: Dongjoon Hyun

> Add Scala 2.13 daily build and test GitHub Action job
> -
>
> Key: SPARK-36152
> URL: https://issues.apache.org/jira/browse/SPARK-36152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36152) Add Scala 2.13 daily build and test GitHub Action job

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36152:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Test)

> Add Scala 2.13 daily build and test GitHub Action job
> -
>
> Key: SPARK-36152
> URL: https://issues.apache.org/jira/browse/SPARK-36152
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35850) Upgrade scala-maven-plugin to 4.5.3

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35850:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Improvement)

> Upgrade scala-maven-plugin to 4.5.3
> ---
>
> Key: SPARK-35850
> URL: https://issues.apache.org/jira/browse/SPARK-35850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36110) Upgrade SBT to 1.5.5

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36110:
-

Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade SBT to 1.5.5
> 
>
> Key: SPARK-36110
> URL: https://issues.apache.org/jira/browse/SPARK-36110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0
>
>
> SBT 1.5.5 was released, which includes 16 imrovements/bug fixes.
> https://github.com/sbt/sbt/releases/tag/v1.5.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36110) Upgrade SBT to 1.5.5

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36110:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Improvement)

> Upgrade SBT to 1.5.5
> 
>
> Key: SPARK-36110
> URL: https://issues.apache.org/jira/browse/SPARK-36110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> SBT 1.5.5 was released, which includes 16 imrovements/bug fixes.
> https://github.com/sbt/sbt/releases/tag/v1.5.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36150) Disable MiMa for Scala 2.13 artifacts

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36150:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Test)

> Disable MiMa for Scala 2.13 artifacts
> -
>
> Key: SPARK-36150
> URL: https://issues.apache.org/jira/browse/SPARK-36150
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36166:
-

Assignee: Dongjoon Hyun

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36166:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Test)

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36168) Support Scala 2.13 in `dev/test-dependencies.sh`

2021-07-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-36168:
-

 Summary: Support Scala 2.13 in `dev/test-dependencies.sh`
 Key: SPARK-36168
 URL: https://issues.apache.org/jira/browse/SPARK-36168
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36167) Revisit more InternalField managements.

2021-07-15 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36167:
-

 Summary: Revisit more InternalField managements.
 Key: SPARK-36167
 URL: https://issues.apache.org/jira/browse/SPARK-36167
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Takuya Ueshin


There are other places we can manage {{InternalField}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381613#comment-17381613
 ] 

Apache Spark commented on SPARK-36166:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33376

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36166:


Assignee: Apache Spark

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36166:


Assignee: (was: Apache Spark)

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381612#comment-17381612
 ] 

Apache Spark commented on SPARK-36166:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33376

> Support Scala 2.13 test in `dev/run-tests.py`
> -
>
> Key: SPARK-36166
> URL: https://issues.apache.org/jira/browse/SPARK-36166
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36166) Support Scala 2.13 test in `dev/run-tests.py`

2021-07-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-36166:
-

 Summary: Support Scala 2.13 test in `dev/run-tests.py`
 Key: SPARK-36166
 URL: https://issues.apache.org/jira/browse/SPARK-36166
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36074) add error class for StructType.findNestedField

2021-07-15 Thread Karen Feng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Feng updated SPARK-36074:
---
Parent: SPARK-36094
Issue Type: Sub-task  (was: Improvement)

> add error class for StructType.findNestedField
> --
>
> Key: SPARK-36074
> URL: https://issues.apache.org/jira/browse/SPARK-36074
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381582#comment-17381582
 ] 

Apache Spark commented on SPARK-36034:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33375

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381579#comment-17381579
 ] 

Apache Spark commented on SPARK-36034:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/33375

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381568#comment-17381568
 ] 

Apache Spark commented on SPARK-36164:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33374

> Change run-test.py so that it does not fail when 
> os.environ["APACHE_SPARK_REF"] is not defined. 
> 
>
> Key: SPARK-36164
> URL: https://issues.apache.org/jira/browse/SPARK-36164
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381567#comment-17381567
 ] 

Apache Spark commented on SPARK-36164:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33374

> Change run-test.py so that it does not fail when 
> os.environ["APACHE_SPARK_REF"] is not defined. 
> 
>
> Key: SPARK-36164
> URL: https://issues.apache.org/jira/browse/SPARK-36164
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-36034:


Assignee: Max Gekk

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-36034:
-
Fix Version/s: 3.3.0
   3.2.0

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36034) Incorrect datetime filter when reading Parquet files written in legacy mode

2021-07-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36034.
--
Resolution: Fixed

> Incorrect datetime filter when reading Parquet files written in legacy mode
> ---
>
> Key: SPARK-36034
> URL: https://issues.apache.org/jira/browse/SPARK-36034
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Willi Raschkowski
>Assignee: Max Gekk
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.2.0, 3.3.0
>
>
> We're seeing incorrect date filters on Parquet files written by Spark 2 or by 
> Spark 3 with legacy rebase mode.
> This is the expected behavior that we see in _corrected_ mode (Spark 3.1.2):
> {code:title=Good (Corrected Mode)}
> >>> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> >>> "CORRECTED")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_corrected")
> >>> spark.read.parquet("date_written_by_spark3_corrected").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_corrected").where("date = 
> >>> '0001-01-01'").show()
> +--+
> |  date|
> +--+
> |0001-01-01|
> +--+
> {code}
> This is how we get incorrect results in _legacy_ mode, in this case the 
> filter is dropping rows it shouldn't:
> {code:title=Bad (Legacy Mode)}
> In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", 
> "LEGACY")
> >>> spark.sql("SELECT DATE '0001-01-01' AS 
> >>> date").write.mode("overwrite").parquet("date_written_by_spark3_legacy")
> >>> spark.read.parquet("date_written_by_spark3_legacy").selectExpr("date", 
> >>> "date = '0001-01-01'").show()
> +--+---+
> |  date|(date = 0001-01-01)|
> +--+---+
> |0001-01-01|   true|
> +--+---+
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").show()
> ++
> |date|
> ++
> ++
> >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = 
> >>> '0001-01-01'").explain()
> == Physical Plan ==
> *(1) Filter (isnotnull(date#154) AND (date#154 = -719162))
> +- *(1) ColumnarToRow
>+- FileScan parquet [date#154] Batched: true, DataFilters: 
> [isnotnull(date#154), (date#154 = -719162)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Volumes/git/spark-installs/spark-3.1.2-bin-hadoop3.2/date_written_by_spar...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(date), 
> EqualTo(date,0001-01-01)], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36127) Support comparison between a Categorical and a scalar

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36127:


Assignee: Apache Spark

> Support comparison between a Categorical and a scalar
> -
>
> Key: SPARK-36127
> URL: https://issues.apache.org/jira/browse/SPARK-36127
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36127) Support comparison between a Categorical and a scalar

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36127:


Assignee: (was: Apache Spark)

> Support comparison between a Categorical and a scalar
> -
>
> Key: SPARK-36127
> URL: https://issues.apache.org/jira/browse/SPARK-36127
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36127) Support comparison between a Categorical and a scalar

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36127:


Assignee: Apache Spark

> Support comparison between a Categorical and a scalar
> -
>
> Key: SPARK-36127
> URL: https://issues.apache.org/jira/browse/SPARK-36127
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36127) Support comparison between a Categorical and a scalar

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381550#comment-17381550
 ] 

Apache Spark commented on SPARK-36127:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33373

> Support comparison between a Categorical and a scalar
> -
>
> Key: SPARK-36127
> URL: https://issues.apache.org/jira/browse/SPARK-36127
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36164.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33371
[https://github.com/apache/spark/pull/33371]

> Change run-test.py so that it does not fail when 
> os.environ["APACHE_SPARK_REF"] is not defined. 
> 
>
> Key: SPARK-36164
> URL: https://issues.apache.org/jira/browse/SPARK-36164
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36164:
-

Assignee: William Hyun

> Change run-test.py so that it does not fail when 
> os.environ["APACHE_SPARK_REF"] is not defined. 
> 
>
> Key: SPARK-36164
> URL: https://issues.apache.org/jira/browse/SPARK-36164
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36165) Fix SQL doc generation in GitHub Action

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36165.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33372
[https://github.com/apache/spark/pull/33372]

> Fix SQL doc generation in GitHub Action
> ---
>
> Key: SPARK-36165
> URL: https://issues.apache.org/jira/browse/SPARK-36165
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36165) Fix SQL doc generation in GitHub Action

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36165:
-

Assignee: Dongjoon Hyun

> Fix SQL doc generation in GitHub Action
> ---
>
> Key: SPARK-36165
> URL: https://issues.apache.org/jira/browse/SPARK-36165
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36127) Support comparison between a Categorical and a scalar

2021-07-15 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-36127:
-
Summary: Support comparison between a Categorical and a scalar  (was: 
Adjust non-equality comparison operators to accept scalar)

> Support comparison between a Categorical and a scalar
> -
>
> Key: SPARK-36127
> URL: https://issues.apache.org/jira/browse/SPARK-36127
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36165) Fix SQL doc generation in GitHub Action

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36165:


Assignee: Apache Spark

> Fix SQL doc generation in GitHub Action
> ---
>
> Key: SPARK-36165
> URL: https://issues.apache.org/jira/browse/SPARK-36165
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36165) Fix SQL doc generation in GitHub Action

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381494#comment-17381494
 ] 

Apache Spark commented on SPARK-36165:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33372

> Fix SQL doc generation in GitHub Action
> ---
>
> Key: SPARK-36165
> URL: https://issues.apache.org/jira/browse/SPARK-36165
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36165) Fix SQL doc generation in GitHub Action

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36165:


Assignee: (was: Apache Spark)

> Fix SQL doc generation in GitHub Action
> ---
>
> Key: SPARK-36165
> URL: https://issues.apache.org/jira/browse/SPARK-36165
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36165) Fix SQL doc generation in GitHub Action

2021-07-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-36165:
-

 Summary: Fix SQL doc generation in GitHub Action
 Key: SPARK-36165
 URL: https://issues.apache.org/jira/browse/SPARK-36165
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381476#comment-17381476
 ] 

Apache Spark commented on SPARK-36164:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33371

> Change run-test.py so that it does not fail when 
> os.environ["APACHE_SPARK_REF"] is not defined. 
> 
>
> Key: SPARK-36164
> URL: https://issues.apache.org/jira/browse/SPARK-36164
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36164:


Assignee: Apache Spark

> Change run-test.py so that it does not fail when 
> os.environ["APACHE_SPARK_REF"] is not defined. 
> 
>
> Key: SPARK-36164
> URL: https://issues.apache.org/jira/browse/SPARK-36164
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381477#comment-17381477
 ] 

Apache Spark commented on SPARK-36164:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33371

> Change run-test.py so that it does not fail when 
> os.environ["APACHE_SPARK_REF"] is not defined. 
> 
>
> Key: SPARK-36164
> URL: https://issues.apache.org/jira/browse/SPARK-36164
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36164:


Assignee: (was: Apache Spark)

> Change run-test.py so that it does not fail when 
> os.environ["APACHE_SPARK_REF"] is not defined. 
> 
>
> Key: SPARK-36164
> URL: https://issues.apache.org/jira/browse/SPARK-36164
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36135) Support TimestampNTZ type in file partitioning

2021-07-15 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36135.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33344
[https://github.com/apache/spark/pull/33344]

> Support TimestampNTZ type in file partitioning
> --
>
> Key: SPARK-36135
> URL: https://issues.apache.org/jira/browse/SPARK-36135
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> When the default Timestamp type is TimestampNTZ, Spark should parse the 
> timestamp value partitions as TimestampNTZ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36164) Change run-test.py so that it does not fail when os.environ["APACHE_SPARK_REF"] is not defined.

2021-07-15 Thread William Hyun (Jira)
William Hyun created SPARK-36164:


 Summary: Change run-test.py so that it does not fail when 
os.environ["APACHE_SPARK_REF"] is not defined. 
 Key: SPARK-36164
 URL: https://issues.apache.org/jira/browse/SPARK-36164
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.3.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36157) TimeWindow expression: apply filter before project

2021-07-15 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-36157:
---

Assignee: Jungtaek Lim

> TimeWindow expression: apply filter before project
> --
>
> Key: SPARK-36157
> URL: https://issues.apache.org/jira/browse/SPARK-36157
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> https://github.com/apache/spark/blob/4dfd266b27fea6954593c6b9e3a2819b290f0aec/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3901-L3906
> In the case of tumbling window, we apply project and then filter, while 
> filter is not dependent to project. We can just swap two operator to ensure 
> less rows would be projected if there're some rows being filtered out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36157) TimeWindow expression: apply filter before project

2021-07-15 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-36157.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33367
[https://github.com/apache/spark/pull/33367]

> TimeWindow expression: apply filter before project
> --
>
> Key: SPARK-36157
> URL: https://issues.apache.org/jira/browse/SPARK-36157
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.3.0
>
>
> https://github.com/apache/spark/blob/4dfd266b27fea6954593c6b9e3a2819b290f0aec/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3901-L3906
> In the case of tumbling window, we apply project and then filter, while 
> filter is not dependent to project. We can just swap two operator to ensure 
> less rows would be projected if there're some rows being filtered out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36158) pyspark sql/functions documentation for months_between isn't as precise as scala version

2021-07-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36158.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33366
[https://github.com/apache/spark/pull/33366]

> pyspark sql/functions documentation for months_between isn't as precise as 
> scala version
> 
>
> Key: SPARK-36158
> URL: https://issues.apache.org/jira/browse/SPARK-36158
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.2
>Reporter: Dominik Gehl
>Priority: Trivial
> Fix For: 3.3.0
>
>
> pyspark months_between documentation doesn't mention that months are assumed 
> with 31 days in the calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36134) jackson-databind RCE vulnerability

2021-07-15 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381417#comment-17381417
 ] 

Erik Krogen commented on SPARK-36134:
-

3.1.2 doesn't exist yet, the only release in the 3.1 line is 3.1.1. How are you 
using 3.1.2?

Regardless, the jackson-databind version used by 3.1.1 (and the 3.1 line 
generally) is 2.10.0: 
https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/pom.xml#L172

> jackson-databind RCE vulnerability
> --
>
> Key: SPARK-36134
> URL: https://issues.apache.org/jira/browse/SPARK-36134
> Project: Spark
>  Issue Type: Task
>  Components: Java API
>Affects Versions: 3.1.2, 3.1.3
>Reporter: Sumit
>Priority: Major
> Attachments: Screenshot 2021-07-15 at 1.00.55 PM.png
>
>
> Need to upgrade   jackson-databind version to *2.9.3.1*
> At the beginning of 2018, jackson-databind was reported to contain another 
> remote code execution (RCE) vulnerability (CVE-2017-17485) that affects 
> versions 2.9.3 and earlier, 2.7.9.1 and earlier, and 2.8.10 and earlier. This 
> vulnerability is caused by jackson-dababind’s incomplete blacklist. An 
> application that uses jackson-databind will become vulnerable when the 
> enableDefaultTyping method is called via the ObjectMapper object within the 
> application. An attacker can thus compromise the application by sending 
> maliciously crafted JSON input to gain direct control over a server. 
> Currently, a proof of concept (POC) exploit for this vulnerability has been 
> publicly available. All users who are affected by this vulnerability should 
> upgrade to the latest versions as soon as possible to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36146) Upgrade Python version from 3.6 to higher version in GitHub linter

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36146:
-

Assignee: Hyukjin Kwon

> Upgrade Python version from 3.6 to higher version in GitHub linter
> --
>
> Key: SPARK-36146
> URL: https://issues.apache.org/jira/browse/SPARK-36146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> MyPy checks fails with higher Python versions. For example, with Python 3.8:
> {code}
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name 
> "np.ndarray" is not defined
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name 
> "np.recarray" is not defined
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name 
> "np.ndarray" is not defined
> python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, 
> Any]" of "toArray" incompatible with return type "NoReturn" in supertype 
> "Matrix"
> python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, 
> Any]" of "toArray" incompatible with return type "NoReturn" in supertype 
> "Matrix"
> python/pyspark/pandas/typedef/typehints.py:163: error: Module has no 
> attribute "bool"; maybe "bool_" or "bool8"?
> python/pyspark/pandas/typedef/typehints.py:174: error: Module has no 
> attribute "float"; maybe "float_", "cfloat", or "float96"?
> python/pyspark/pandas/typedef/typehints.py:180: error: Module has no 
> attribute "int"; maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/ml.py:81: error: Value of type variable 
> "_DTypeScalar_co" of "dtype" cannot be "object"
> python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; 
> maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; 
> maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not 
> valid as a type
> python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" 
> or a callback protocol?
> python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not 
> valid as a type
> python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" 
> or a callback protocol?
> python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute 
> "tolist"
> python/pyspark/pandas/series.py:1030: error: Module has no attribute 
> "_NoValue"
> python/pyspark/pandas/series.py:1031: error: Module has no attribute 
> "_NoValue"
> python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" 
> has incompatible type "float"; expected "str"
> python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment 
> (expression has type "Type[floating[Any]]", variable has type "str")
> python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:100: error: Name "np.float" is 
> not defined
> python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is 
> not defined
> python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in 
> assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" 
> defined the type as "List[ndarray[Any, Any]]")
> python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" 
> incompatible with supertype "LinearClassificationModel"
> Found 32 errors in 15 files (checked 315 source files)
> 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (SPARK-36146) Upgrade Python version from 3.6 to higher version in GitHub linter

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36146.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33356
[https://github.com/apache/spark/pull/33356]

> Upgrade Python version from 3.6 to higher version in GitHub linter
> --
>
> Key: SPARK-36146
> URL: https://issues.apache.org/jira/browse/SPARK-36146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> MyPy checks fails with higher Python versions. For example, with Python 3.8:
> {code}
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name 
> "np.ndarray" is not defined
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name 
> "np.recarray" is not defined
> python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name 
> "np.ndarray" is not defined
> python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, 
> Any]" of "toArray" incompatible with return type "NoReturn" in supertype 
> "Matrix"
> python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, 
> Any]" of "toArray" incompatible with return type "NoReturn" in supertype 
> "Matrix"
> python/pyspark/pandas/typedef/typehints.py:163: error: Module has no 
> attribute "bool"; maybe "bool_" or "bool8"?
> python/pyspark/pandas/typedef/typehints.py:174: error: Module has no 
> attribute "float"; maybe "float_", "cfloat", or "float96"?
> python/pyspark/pandas/typedef/typehints.py:180: error: Module has no 
> attribute "int"; maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/ml.py:81: error: Value of type variable 
> "_DTypeScalar_co" of "dtype" cannot be "object"
> python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; 
> maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; 
> maybe "uint", "rint", or "intp"?
> python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not 
> valid as a type
> python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" 
> or a callback protocol?
> python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not 
> valid as a type
> python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" 
> or a callback protocol?
> python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute 
> "tolist"
> python/pyspark/pandas/series.py:1030: error: Module has no attribute 
> "_NoValue"
> python/pyspark/pandas/series.py:1031: error: Module has no attribute 
> "_NoValue"
> python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of 
> "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" 
> has incompatible type "float"; expected "str"
> python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment 
> (expression has type "Type[floating[Any]]", variable has type "str")
> python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
> python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item 
> "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
> python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not 
> defined
> python/pyspark/pandas/tests/test_typedef.py:100: error: Name "np.float" is 
> not defined
> python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is 
> not defined
> python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in 
> assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" 
> defined the type as "List[ndarray[Any, Any]]")
> python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" 
> incompatible with supertype "LinearClassificationModel"
> 

[jira] [Resolved] (SPARK-36159) Replace 'python' to 'python3' in dev/test-dependencies.sh

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36159.
---
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 33368
[https://github.com/apache/spark/pull/33368]

> Replace 'python' to 'python3' in dev/test-dependencies.sh 
> --
>
> Key: SPARK-36159
> URL: https://issues.apache.org/jira/browse/SPARK-36159
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
>
> There's one last place to change python to python3 at 
> dev/test-dependencies.sh. This is a followup of SPARK-29672



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36159) Replace 'python' to 'python3' in dev/test-dependencies.sh

2021-07-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36159:
-

Assignee: Hyukjin Kwon

> Replace 'python' to 'python3' in dev/test-dependencies.sh 
> --
>
> Key: SPARK-36159
> URL: https://issues.apache.org/jira/browse/SPARK-36159
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> There's one last place to change python to python3 at 
> dev/test-dependencies.sh. This is a followup of SPARK-29672



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381389#comment-17381389
 ] 

Apache Spark commented on SPARK-36163:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/33370

> Propagate correct JDBC properties in JDBC connector provider and add 
> "connectionProvider" option
> 
>
> Key: SPARK-36163
> URL: https://issues.apache.org/jira/browse/SPARK-36163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Ivan
>Priority: Major
>
> There are a couple of issues with JDBC connection providers. The first is a 
> bug caused by 
> [https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653]
>  where we would pass all properties, including JDBC data source keys, to the 
> JDBC driver which results in errors like {{java.sql.SQLException: 
> Unrecognized connection property 'url'}}.
> Connection properties are supposed to only include vendor properties, url 
> config is a JDBC option and should be excluded.
> The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with 
> {{jdbcOptions.asConnectionProperties.asScala.foreach}} which is 
> java.sql.Driver friendly.
>  
> I also investigated the problem with multiple providers and I think there are 
> a couple of oversights in {{ConnectionProvider}} implementation. I think it 
> is missing two things:
>  * Any {{JdbcConnectionProvider}} should take precedence over 
> {{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be 
> selected if there was no match found when inferring providers that can handle 
> JDBC url.
>  * There is currently no way to select a specific provider that you want, 
> similar to how you can select a JDBC driver. The use case is, for example, 
> having connection providers for two databases that handle the same URL but 
> have slightly different semantics and you want to select one in one case and 
> the other one in others.
>  ** I think the first point could be discarded when the second one is 
> addressed.
> You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to 
> exclude ones that don’t need to be included, but I am not quite sure why it 
> was done that way - it is much simpler to allow users to enforce the provider 
> they want.
> This ticket fixes it by adding a {{connectionProvider}} option to the JDBC 
> data source that allows users to select a particular provider when the 
> ambiguity arises.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36163:


Assignee: (was: Apache Spark)

> Propagate correct JDBC properties in JDBC connector provider and add 
> "connectionProvider" option
> 
>
> Key: SPARK-36163
> URL: https://issues.apache.org/jira/browse/SPARK-36163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Ivan
>Priority: Major
>
> There are a couple of issues with JDBC connection providers. The first is a 
> bug caused by 
> [https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653]
>  where we would pass all properties, including JDBC data source keys, to the 
> JDBC driver which results in errors like {{java.sql.SQLException: 
> Unrecognized connection property 'url'}}.
> Connection properties are supposed to only include vendor properties, url 
> config is a JDBC option and should be excluded.
> The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with 
> {{jdbcOptions.asConnectionProperties.asScala.foreach}} which is 
> java.sql.Driver friendly.
>  
> I also investigated the problem with multiple providers and I think there are 
> a couple of oversights in {{ConnectionProvider}} implementation. I think it 
> is missing two things:
>  * Any {{JdbcConnectionProvider}} should take precedence over 
> {{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be 
> selected if there was no match found when inferring providers that can handle 
> JDBC url.
>  * There is currently no way to select a specific provider that you want, 
> similar to how you can select a JDBC driver. The use case is, for example, 
> having connection providers for two databases that handle the same URL but 
> have slightly different semantics and you want to select one in one case and 
> the other one in others.
>  ** I think the first point could be discarded when the second one is 
> addressed.
> You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to 
> exclude ones that don’t need to be included, but I am not quite sure why it 
> was done that way - it is much simpler to allow users to enforce the provider 
> they want.
> This ticket fixes it by adding a {{connectionProvider}} option to the JDBC 
> data source that allows users to select a particular provider when the 
> ambiguity arises.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option

2021-07-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36163:


Assignee: Apache Spark

> Propagate correct JDBC properties in JDBC connector provider and add 
> "connectionProvider" option
> 
>
> Key: SPARK-36163
> URL: https://issues.apache.org/jira/browse/SPARK-36163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Ivan
>Assignee: Apache Spark
>Priority: Major
>
> There are a couple of issues with JDBC connection providers. The first is a 
> bug caused by 
> [https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653]
>  where we would pass all properties, including JDBC data source keys, to the 
> JDBC driver which results in errors like {{java.sql.SQLException: 
> Unrecognized connection property 'url'}}.
> Connection properties are supposed to only include vendor properties, url 
> config is a JDBC option and should be excluded.
> The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with 
> {{jdbcOptions.asConnectionProperties.asScala.foreach}} which is 
> java.sql.Driver friendly.
>  
> I also investigated the problem with multiple providers and I think there are 
> a couple of oversights in {{ConnectionProvider}} implementation. I think it 
> is missing two things:
>  * Any {{JdbcConnectionProvider}} should take precedence over 
> {{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be 
> selected if there was no match found when inferring providers that can handle 
> JDBC url.
>  * There is currently no way to select a specific provider that you want, 
> similar to how you can select a JDBC driver. The use case is, for example, 
> having connection providers for two databases that handle the same URL but 
> have slightly different semantics and you want to select one in one case and 
> the other one in others.
>  ** I think the first point could be discarded when the second one is 
> addressed.
> You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to 
> exclude ones that don’t need to be included, but I am not quite sure why it 
> was done that way - it is much simpler to allow users to enforce the provider 
> they want.
> This ticket fixes it by adding a {{connectionProvider}} option to the JDBC 
> data source that allows users to select a particular provider when the 
> ambiguity arises.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option

2021-07-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381388#comment-17381388
 ] 

Apache Spark commented on SPARK-36163:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/33370

> Propagate correct JDBC properties in JDBC connector provider and add 
> "connectionProvider" option
> 
>
> Key: SPARK-36163
> URL: https://issues.apache.org/jira/browse/SPARK-36163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.1.1, 3.1.2
>Reporter: Ivan
>Priority: Major
>
> There are a couple of issues with JDBC connection providers. The first is a 
> bug caused by 
> [https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653]
>  where we would pass all properties, including JDBC data source keys, to the 
> JDBC driver which results in errors like {{java.sql.SQLException: 
> Unrecognized connection property 'url'}}.
> Connection properties are supposed to only include vendor properties, url 
> config is a JDBC option and should be excluded.
> The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with 
> {{jdbcOptions.asConnectionProperties.asScala.foreach}} which is 
> java.sql.Driver friendly.
>  
> I also investigated the problem with multiple providers and I think there are 
> a couple of oversights in {{ConnectionProvider}} implementation. I think it 
> is missing two things:
>  * Any {{JdbcConnectionProvider}} should take precedence over 
> {{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be 
> selected if there was no match found when inferring providers that can handle 
> JDBC url.
>  * There is currently no way to select a specific provider that you want, 
> similar to how you can select a JDBC driver. The use case is, for example, 
> having connection providers for two databases that handle the same URL but 
> have slightly different semantics and you want to select one in one case and 
> the other one in others.
>  ** I think the first point could be discarded when the second one is 
> addressed.
> You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to 
> exclude ones that don’t need to be included, but I am not quite sure why it 
> was done that way - it is much simpler to allow users to enforce the provider 
> they want.
> This ticket fixes it by adding a {{connectionProvider}} option to the JDBC 
> data source that allows users to select a particular provider when the 
> ambiguity arises.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36163) Propagate correct JDBC properties in JDBC connector provider and add "connectionProvider" option

2021-07-15 Thread Ivan (Jira)
Ivan created SPARK-36163:


 Summary: Propagate correct JDBC properties in JDBC connector 
provider and add "connectionProvider" option
 Key: SPARK-36163
 URL: https://issues.apache.org/jira/browse/SPARK-36163
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.1.1, 3.1.0
Reporter: Ivan


There are a couple of issues with JDBC connection providers. The first is a bug 
caused by 
[https://github.com/apache/spark/commit/c3ce9701b458511255072c72b9b245036fa98653]
 where we would pass all properties, including JDBC data source keys, to the 
JDBC driver which results in errors like {{java.sql.SQLException: Unrecognized 
connection property 'url'}}.

Connection properties are supposed to only include vendor properties, url 
config is a JDBC option and should be excluded.

The fix would be replacing {{jdbcOptions.asProperties.asScala.foreach}} with 
{{jdbcOptions.asConnectionProperties.asScala.foreach}} which is java.sql.Driver 
friendly.

 

I also investigated the problem with multiple providers and I think there are a 
couple of oversights in {{ConnectionProvider}} implementation. I think it is 
missing two things:
 * Any {{JdbcConnectionProvider}} should take precedence over 
{{BasicConnectionProvider}}. {{BasicConnectionProvider}} should only be 
selected if there was no match found when inferring providers that can handle 
JDBC url.

 * There is currently no way to select a specific provider that you want, 
similar to how you can select a JDBC driver. The use case is, for example, 
having connection providers for two databases that handle the same URL but have 
slightly different semantics and you want to select one in one case and the 
other one in others.

 ** I think the first point could be discarded when the second one is addressed.

You can technically use {{spark.sql.sources.disabledJdbcConnProviderList}} to 
exclude ones that don’t need to be included, but I am not quite sure why it was 
done that way - it is much simpler to allow users to enforce the provider they 
want.

This ticket fixes it by adding a {{connectionProvider}} option to the JDBC data 
source that allows users to select a particular provider when the ambiguity 
arises.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >