date:20240131

(spark) branch master updated: [SPARK-46473][SQL] Reuse `getPartitionedFile` method

2024-01-31 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 223afea9960c [SPARK-46473][SQL] Reuse `getPartitionedFile` method
223afea9960c is described below

commit 223afea9960c7ef1a4c8654e043e860f6c248185
Author: huangxiaoping <1754789...@qq.com>
AuthorDate: Wed Jan 31 22:59:20 2024 -0600

[SPARK-46473][SQL] Reuse `getPartitionedFile` method

### What changes were proposed in this pull request?
Reuse `getPartitionedFile` method to reduce redundant code.

### Why are the changes needed?
Reduce redundant code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44437 from huangxiaopingRD/SPARK-46473.

Authored-by: huangxiaoping <1754789...@qq.com>
Signed-off-by: Sean Owen 
---
 .../apache/spark/sql/execution/DataSourceScanExec.scala|  2 +-
 .../apache/spark/sql/execution/PartitionedFileUtil.scala   | 14 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
index b3b2b0eab055..2622eadaefb3 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
@@ -645,7 +645,7 @@ case class FileSourceScanExec(
 logInfo(s"Planning with ${bucketSpec.numBuckets} buckets")
 val filesGroupedToBuckets =
   selectedPartitions.flatMap { p =>
-p.files.map(f => PartitionedFileUtil.getPartitionedFile(f, p.values))
+p.files.map(f => PartitionedFileUtil.getPartitionedFile(f, p.values, 
0, f.getLen))
   }.groupBy { f =>
 BucketingUtils
   .getBucketId(f.toPath.getName)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala
index b31369b6768e..997859058de1 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala
@@ -33,20 +33,20 @@ object PartitionedFileUtil {
   (0L until file.getLen by maxSplitBytes).map { offset =>
 val remaining = file.getLen - offset
 val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
-val hosts = getBlockHosts(getBlockLocations(file.fileStatus), offset, 
size)
-PartitionedFile(partitionValues, SparkPath.fromPath(file.getPath), 
offset, size, hosts,
-  file.getModificationTime, file.getLen, file.metadata)
+getPartitionedFile(file, partitionValues, offset, size)
   }
 } else {
-  Seq(getPartitionedFile(file, partitionValues))
+  Seq(getPartitionedFile(file, partitionValues, 0, file.getLen))
 }
   }
 
   def getPartitionedFile(
   file: FileStatusWithMetadata,
-  partitionValues: InternalRow): PartitionedFile = {
-val hosts = getBlockHosts(getBlockLocations(file.fileStatus), 0, 
file.getLen)
-PartitionedFile(partitionValues, SparkPath.fromPath(file.getPath), 0, 
file.getLen, hosts,
+  partitionValues: InternalRow,
+  start: Long,
+  length: Long): PartitionedFile = {
+val hosts = getBlockHosts(getBlockLocations(file.fileStatus), start, 
length)
+PartitionedFile(partitionValues, SparkPath.fromPath(file.getPath), start, 
length, hosts,
   file.getModificationTime, file.getLen, file.metadata)
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger with int

2024-01-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c63dea8f4235 [SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger 
with int
c63dea8f4235 is described below

commit c63dea8f42357ecfd4fe41f04732e2cb0d0d53ae
Author: beliefer 
AuthorDate: Wed Jan 31 17:47:50 2024 -0800

[SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger with int

### What changes were proposed in this pull request?
This PR propose to replace unnecessary `AtomicInteger` with int.

### Why are the changes needed?
The variable `value` of `GetMaxCounter` always guarded by itself. So we can 
replace the unnecessary `AtomicInteger` with int.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44907 from beliefer/SPARK-46882.

Authored-by: beliefer 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/streaming/util/WriteAheadLogSuite.scala| 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala
 
b/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala
index 3a9fffec13cf..cf9d5b7387f7 100644
--- 
a/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala
+++ 
b/streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala
@@ -20,7 +20,6 @@ import java.io._
 import java.nio.ByteBuffer
 import java.util.{Iterator => JIterator}
 import java.util.concurrent.{CountDownLatch, RejectedExecutionException, 
ThreadPoolExecutor, TimeUnit}
-import java.util.concurrent.atomic.AtomicInteger
 
 import scala.collection.mutable.ArrayBuffer
 import scala.concurrent._
@@ -238,14 +237,14 @@ class FileBasedWriteAheadLogSuite
 val executionContext = ExecutionContext.fromExecutorService(fpool)
 
 class GetMaxCounter {
-  private val value = new AtomicInteger()
-  @volatile private var max: Int = 0
+  private var value = 0
+  private var max: Int = 0
   def increment(): Unit = synchronized {
-val atInstant = value.incrementAndGet()
-if (atInstant > max) max = atInstant
+value = value + 1
+if (value > max) max = value
   }
-  def decrement(): Unit = synchronized { value.decrementAndGet() }
-  def get(): Int = synchronized { value.get() }
+  def decrement(): Unit = synchronized { value = value - 1 }
+  def get(): Int = synchronized { value }
   def getMax(): Int = synchronized { max }
 }
 try {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46931][PS] Implement `{Frame, Series}.to_hdf`

2024-01-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 88f121c47778 [SPARK-46931][PS] Implement `{Frame, Series}.to_hdf`
88f121c47778 is described below

commit 88f121c47778f0755862046d09484a83932cb30b
Author: Ruifeng Zheng 
AuthorDate: Wed Jan 31 08:41:21 2024 -0800

[SPARK-46931][PS] Implement `{Frame, Series}.to_hdf`

### What changes were proposed in this pull request?
Implement `{Frame, Series}.to_hdf`

### Why are the changes needed?
pandas parity

### Does this PR introduce _any_ user-facing change?
yes
```
In [3]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 
'b', 'c'])

In [4]: df.to_hdf('/tmp/data.h5', key='df', mode='w')

In [5]: psdf = ps.from_pandas(df)

In [6]: psdf.to_hdf('/tmp/data2.h5', key='df', mode='w')
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:1015: 
PandasAPIOnSparkAdviceWarning: `to_hdf` loads all data into the driver's 
memory. It should only be used if the resulting DataFrame is expected to be 
small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)

In [7]: !ls /tmp/*h5
/tmp/data.h5/tmp/data2.h5

In [8]: !ls -lh /tmp/*h5
-rw-r--r-- 1 ruifeng.zheng  wheel   6.9K Jan 31 12:21 /tmp/data.h5
-rw-r--r-- 1 ruifeng.zheng  wheel   6.9K Jan 31 12:21 /tmp/data2.h5
```

### How was this patch tested?
manually test, `hdf` requires additional library `pytables` which in turn 
needs [many 
prerequisites](https://www.pytables.org/usersguide/installation.html#prerequisites)

since `pytables` is just a optional dep of `Pandas`, so I think we can 
avoid adding it to CI first.

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44966 from zhengruifeng/ps_to_hdf.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../docs/source/reference/pyspark.pandas/frame.rst |   1 +
 .../source/reference/pyspark.pandas/series.rst |   1 +
 python/pyspark/pandas/generic.py   | 120 +
 python/pyspark/pandas/missing/frame.py |   1 -
 python/pyspark/pandas/missing/series.py|   1 -
 5 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/reference/pyspark.pandas/frame.rst 
b/python/docs/source/reference/pyspark.pandas/frame.rst
index 12cf6e7db12f..77b60468b8fb 100644
--- a/python/docs/source/reference/pyspark.pandas/frame.rst
+++ b/python/docs/source/reference/pyspark.pandas/frame.rst
@@ -286,6 +286,7 @@ Serialization / IO / Conversion
DataFrame.to_json
DataFrame.to_dict
DataFrame.to_excel
+   DataFrame.to_hdf
DataFrame.to_clipboard
DataFrame.to_markdown
DataFrame.to_records
diff --git a/python/docs/source/reference/pyspark.pandas/series.rst 
b/python/docs/source/reference/pyspark.pandas/series.rst
index 88d1861c6ccf..5606fa93a5f3 100644
--- a/python/docs/source/reference/pyspark.pandas/series.rst
+++ b/python/docs/source/reference/pyspark.pandas/series.rst
@@ -486,6 +486,7 @@ Serialization / IO / Conversion
Series.to_json
Series.to_csv
Series.to_excel
+   Series.to_hdf
Series.to_frame
 
 Pandas-on-Spark specific
diff --git a/python/pyspark/pandas/generic.py b/python/pyspark/pandas/generic.py
index 77cefb53fe5d..ed2aeb8ea6af 100644
--- a/python/pyspark/pandas/generic.py
+++ b/python/pyspark/pandas/generic.py
@@ -1103,6 +1103,126 @@ class Frame(object, metaclass=ABCMeta):
 psdf._to_internal_pandas(), self.to_excel, f, args
 )
 
+def to_hdf(
+self,
+path_or_buf: Union[str, pd.HDFStore],
+key: str,
+mode: str = "a",
+complevel: Optional[int] = None,
+complib: Optional[str] = None,
+append: bool = False,
+format: Optional[str] = None,
+index: bool = True,
+min_itemsize: Optional[Union[int, Dict[str, int]]] = None,
+nan_rep: Optional[Any] = None,
+dropna: Optional[bool] = None,
+data_columns: Optional[Union[bool, List[str]]] = None,
+errors: str = "strict",
+encoding: str = "UTF-8",
+) -> None:
+"""
+Write the contained data to an HDF5 file using HDFStore.
+
+.. note:: This method should only be used if the resulting DataFrame 
is expected
+  to be small, as all the data is loaded into the driver's 
memory.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+path_or_buf : str or pandas.HDFStore
+File path or HDFStore object.
+key : str
+Identifier for the group in the store.
+mode : {'a', 'w', 'r+'}, default 'a'
+Mode to open file:
+
+- 'w': write, a new file is

(spark) branch master updated: [SPARK-46930][SQL] Add support for a custom prefix for Union type fields in Avro

2024-01-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d49265a170fb [SPARK-46930][SQL] Add support for a custom prefix for 
Union type fields in Avro
d49265a170fb is described below

commit d49265a170fb7bb06471d97f4483139529939ecd
Author: Ivan Sadikov 
AuthorDate: Wed Jan 31 08:39:46 2024 -0800

[SPARK-46930][SQL] Add support for a custom prefix for Union type fields in 
Avro

### What changes were proposed in this pull request?

This PR enhances stable ids functionality in Avro by allowing users to 
configure a custom prefix for Union type member fields when 
`enableStableIdentifiersForUnionType` is enabled.

Without the patch, the fields are generated with `member_` prefix, e.g. 
`member_int`, `member_string`. This could become difficult to change for 
complex schemas.

The solution is to add a new option `stableIdentifierPrefixForUnionType` 
which defaults to `member_` and allows users to configure whatever prefix they 
require, e.g. `member`, `tmp_`, or even an empty string.

### Why are the changes needed?

Allows to customise the prefix of stable ids in Avro without the need to 
rename all of the columns which could be cumbersome for complex schemas.

### Does this PR introduce _any_ user-facing change?

Yes. The PR adds a new option in Avro: `stableIdentifierPrefixForUnionType`.

### How was this patch tested?

Existing tests + a new unit test to verify different prefixes.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44964 from sadikovi/SPARK-46930.

Authored-by: Ivan Sadikov 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/avro/AvroDataToCatalyst.scala | 12 +++--
 .../apache/spark/sql/avro/AvroDeserializer.scala   | 12 +++--
 .../org/apache/spark/sql/avro/AvroFileFormat.scala |  3 +-
 .../org/apache/spark/sql/avro/AvroOptions.scala|  6 +++
 .../org/apache/spark/sql/avro/AvroUtils.scala  |  5 +-
 .../apache/spark/sql/avro/SchemaConverters.scala   | 58 +++---
 .../sql/v2/avro/AvroPartitionReaderFactory.scala   |  3 +-
 .../sql/avro/AvroCatalystDataConversionSuite.scala |  7 +--
 .../apache/spark/sql/avro/AvroFunctionsSuite.scala |  3 +-
 .../apache/spark/sql/avro/AvroRowReaderSuite.scala |  3 +-
 .../org/apache/spark/sql/avro/AvroSerdeSuite.scala |  3 +-
 .../org/apache/spark/sql/avro/AvroSuite.scala  | 54 +++-
 docs/sql-data-sources-avro.md  | 10 +++-
 13 files changed, 133 insertions(+), 46 deletions(-)

diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
index 9f31a2db55a5..7d80998d96eb 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
@@ -40,7 +40,9 @@ private[sql] case class AvroDataToCatalyst(
 
   override lazy val dataType: DataType = {
 val dt = SchemaConverters.toSqlType(
-  expectedSchema, avroOptions.useStableIdForUnionType).dataType
+  expectedSchema,
+  avroOptions.useStableIdForUnionType,
+  avroOptions.stableIdPrefixForUnionType).dataType
 parseMode match {
   // With PermissiveMode, the output Catalyst row might contain columns of 
null values for
   // corrupt records, even if some of the columns are not nullable in the 
user-provided schema.
@@ -62,8 +64,12 @@ private[sql] case class AvroDataToCatalyst(
   @transient private lazy val reader = new 
GenericDatumReader[Any](actualSchema, expectedSchema)
 
   @transient private lazy val deserializer =
-new AvroDeserializer(expectedSchema, dataType,
-  avroOptions.datetimeRebaseModeInRead, 
avroOptions.useStableIdForUnionType)
+new AvroDeserializer(
+  expectedSchema,
+  dataType,
+  avroOptions.datetimeRebaseModeInRead,
+  avroOptions.useStableIdForUnionType,
+  avroOptions.stableIdPrefixForUnionType)
 
   @transient private var decoder: BinaryDecoder = _
 
diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
index 9e10fac8bb55..139c45adb442 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
@@ -50,20 +50,23 @@ private[sql] class AvroDeserializer(
 positionalFieldMatch: Boolean,
 datetimeRebaseSpec: RebaseSpec,
 filters: StructFilters,
-useStableIdForUnionType: Boolean) {
+useStableIdForUnionType: Boolean,
+

(spark) branch master updated: [SPARK-46929][CORE][CONNECT][SS] Use ThreadUtils.shutdown to close thread pools

2024-01-31 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 262ed5bcab0b [SPARK-46929][CORE][CONNECT][SS] Use ThreadUtils.shutdown 
to close thread pools
262ed5bcab0b is described below

commit 262ed5bcab0ba750b089b0693dbb1a59ef6fd11f
Author: beliefer 
AuthorDate: Wed Jan 31 09:52:19 2024 -0600

[SPARK-46929][CORE][CONNECT][SS] Use ThreadUtils.shutdown to close thread 
pools

### What changes were proposed in this pull request?
This PR propose use `ThreadUtils.shutdown` to close thread pools.

### Why are the changes needed?
`ThreadUtils` provided the `shutdown` to close thread pools. `ThreadUtils` 
wraps common logic to shutdown thread pools.
We should use `ThreadUtils.shutdown` to close the thread pool.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44962 from beliefer/SPARK-46929.

Authored-by: beliefer 
Signed-off-by: Sean Owen 
---
 .../sql/connect/service/SparkConnectExecutionManager.scala  |  5 +++--
 .../sql/connect/service/SparkConnectSessionManager.scala|  5 +++--
 .../connect/service/SparkConnectStreamingQueryCache.scala   |  9 +++--
 .../scala/org/apache/spark/ExecutorAllocationManager.scala  |  4 ++--
 .../org/apache/spark/status/ElementTrackingStore.scala  |  6 ++
 .../main/scala/org/apache/spark/streaming/Checkpoint.scala  | 12 +---
 .../org/apache/spark/streaming/scheduler/JobScheduler.scala | 13 ++---
 7 files changed, 24 insertions(+), 30 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectExecutionManager.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectExecutionManager.scala
index c90f53ac07df..85fb150b3171 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectExecutionManager.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectExecutionManager.scala
@@ -21,6 +21,7 @@ import java.util.concurrent.{Executors, 
ScheduledExecutorService, TimeUnit}
 import javax.annotation.concurrent.GuardedBy
 
 import scala.collection.mutable
+import scala.concurrent.duration.FiniteDuration
 import scala.jdk.CollectionConverters._
 import scala.util.control.NonFatal
 
@@ -30,6 +31,7 @@ import org.apache.spark.{SparkEnv, SparkSQLException}
 import org.apache.spark.connect.proto
 import org.apache.spark.internal.Logging
 import 
org.apache.spark.sql.connect.config.Connect.{CONNECT_EXECUTE_MANAGER_ABANDONED_TOMBSTONES_SIZE,
 CONNECT_EXECUTE_MANAGER_DETACHED_TIMEOUT, 
CONNECT_EXECUTE_MANAGER_MAINTENANCE_INTERVAL}
+import org.apache.spark.util.ThreadUtils
 
 // Unique key identifying execution by combination of user, session and 
operation id
 case class ExecuteKey(userId: String, sessionId: String, operationId: String)
@@ -167,8 +169,7 @@ private[connect] class SparkConnectExecutionManager() 
extends Logging {
 
   private[connect] def shutdown(): Unit = executionsLock.synchronized {
 scheduledExecutor.foreach { executor =>
-  executor.shutdown()
-  executor.awaitTermination(1, TimeUnit.MINUTES)
+  ThreadUtils.shutdown(executor, FiniteDuration(1, TimeUnit.MINUTES))
 }
 scheduledExecutor = None
 // note: this does not cleanly shut down the executions, but the server is 
shutting down.
diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectSessionManager.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectSessionManager.scala
index ef14cd305d40..4da728b95a33 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectSessionManager.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectSessionManager.scala
@@ -22,6 +22,7 @@ import java.util.concurrent.{Executors, 
ScheduledExecutorService, TimeUnit}
 import javax.annotation.concurrent.GuardedBy
 
 import scala.collection.mutable
+import scala.concurrent.duration.FiniteDuration
 import scala.jdk.CollectionConverters._
 import scala.util.control.NonFatal
 
@@ -31,6 +32,7 @@ import org.apache.spark.{SparkEnv, SparkSQLException}
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.SparkSession
 import 
org.apache.spark.sql.connect.config.Connect.{CONNECT_SESSION_MANAGER_CLOSED_SESSIONS_TOMBSTONES_SIZE,
 CONNECT_SESSION_MANAGER_DEFAULT_SESSION_TIMEOUT, 
CONNECT_SESSION_MANAGER_MAINTENANCE_INTERVAL}
+import org.apache.spark.util.ThreadUtils
 
 /**
  * Global tracker of all SessionHolders

(spark) branch master updated: [SPARK-46400][CORE][SQL] When there are corrupted files in the local maven repo, skip this cache and try again

2024-01-31 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f2a471e9cc75 [SPARK-46400][CORE][SQL] When there are corrupted files 
in the local maven repo, skip this cache and try again
f2a471e9cc75 is described below

commit f2a471e9cc752f3826232eedc9025fd156a85965
Author: panbingkun 
AuthorDate: Wed Jan 31 09:46:07 2024 -0600

[SPARK-46400][CORE][SQL] When there are corrupted files in the local maven 
repo, skip this cache and try again

### What changes were proposed in this pull request?
The pr aims to
- fix potential bug(ie: https://github.com/apache/spark/pull/44208) and 
enhance user experience.
- make the code more compliant with standards

### Why are the changes needed?
We use the local maven repo as the first-level cache in ivy.  The original 
intention was to reduce the time required to parse and obtain the ar, but when 
there are corrupted files in the local maven repo,The above mechanism will be 
directly interrupted and the prompt is very unfriendly, which will greatly 
confuse the user.  Based on the original intention, we should skip the cache 
directly in similar situations.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44343 from panbingkun/SPARK-46400.

Authored-by: panbingkun 
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/util/MavenUtils.scala   | 147 +++--
 .../sql/hive/client/IsolatedClientLoader.scala |   4 +
 2 files changed, 112 insertions(+), 39 deletions(-)

diff --git a/common/utils/src/main/scala/org/apache/spark/util/MavenUtils.scala 
b/common/utils/src/main/scala/org/apache/spark/util/MavenUtils.scala
index 2d7fba6f07d5..65530b7fa473 100644
--- a/common/utils/src/main/scala/org/apache/spark/util/MavenUtils.scala
+++ b/common/utils/src/main/scala/org/apache/spark/util/MavenUtils.scala
@@ -27,7 +27,7 @@ import org.apache.ivy.Ivy
 import org.apache.ivy.core.LogOptions
 import org.apache.ivy.core.module.descriptor.{Artifact, 
DefaultDependencyDescriptor, DefaultExcludeRule, DefaultModuleDescriptor, 
ExcludeRule}
 import org.apache.ivy.core.module.id.{ArtifactId, ModuleId, ModuleRevisionId}
-import org.apache.ivy.core.report.ResolveReport
+import org.apache.ivy.core.report.{DownloadStatus, ResolveReport}
 import org.apache.ivy.core.resolve.ResolveOptions
 import org.apache.ivy.core.retrieve.RetrieveOptions
 import org.apache.ivy.core.settings.IvySettings
@@ -43,8 +43,8 @@ import org.apache.spark.util.ArrayImplicits._
 private[spark] object MavenUtils extends Logging {
   val JAR_IVY_SETTING_PATH_KEY: String = "spark.jars.ivySettings"
 
-//  // Exposed for testing
-//  var printStream = SparkSubmit.printStream
+  // Exposed for testing
+  // var printStream = SparkSubmit.printStream
 
   // Exposed for testing.
   // These components are used to make the default exclusion rules for Spark 
dependencies.
@@ -113,7 +113,7 @@ private[spark] object MavenUtils extends Logging {
 splits(2) != null && splits(2).trim.nonEmpty,
 s"The version cannot be null or " +
   s"be whitespace. The version provided is: ${splits(2)}")
-  new MavenCoordinate(splits(0), splits(1), splits(2))
+  MavenCoordinate(splits(0), splits(1), splits(2))
 }.toImmutableArraySeq
   }
 
@@ -128,24 +128,30 @@ private[spark] object MavenUtils extends Logging {
   }
 
   /**
-   * Extracts maven coordinates from a comma-delimited string
+   * Create a ChainResolver used by Ivy to search for and resolve dependencies.
*
* @param defaultIvyUserDir
*   The default user path for Ivy
+   * @param useLocalM2AsCache
+   *   Whether to use the local maven repo as a cache
* @return
*   A ChainResolver used by Ivy to search for and resolve dependencies.
*/
-  private[util] def createRepoResolvers(defaultIvyUserDir: File): 
ChainResolver = {
+  private[util] def createRepoResolvers(
+  defaultIvyUserDir: File,
+  useLocalM2AsCache: Boolean = true): ChainResolver = {
 // We need a chain resolver if we want to check multiple repositories
 val cr = new ChainResolver
 cr.setName("spark-list")
 
-val localM2 = new IBiblioResolver
-localM2.setM2compatible(true)
-localM2.setRoot(m2Path.toURI.toString)
-localM2.setUsepoms(true)
-localM2.setName("local-m2-cache")
-cr.add(localM2)
+if (useLocalM2AsCache) {
+  val localM2 = new IBiblioResolver
+  localM2.setM2compatible(true)
+  localM2.setRoot(m2Path.toURI.toString)
+  localM2.setUsepoms(true)
+  localM2.setName("local-m2-cache")
+  cr.add(localM2)
+}
 
 val localIvy = new FileSystemResolver
 val

(spark) branch master updated: [SPARK-45522][BUILD][CORE][SQL][UI] Migrate from Jetty 9 to Jetty 10

2024-01-31 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6c19bf6b48e7 [SPARK-45522][BUILD][CORE][SQL][UI] Migrate from Jetty 9 
to Jetty 10
6c19bf6b48e7 is described below

commit 6c19bf6b48e7e2ab9937dc2d91ea23dd83abae64
Author: HiuFung Kwok 
AuthorDate: Wed Jan 31 09:42:16 2024 -0600

[SPARK-45522][BUILD][CORE][SQL][UI] Migrate from Jetty 9 to Jetty 10

### What changes were proposed in this pull request?

This is an upgrade ticket to bump the Jetty version from 9 to 10.
This PR aims to bring incremental Jetty upgrades to Spark, as Jetty 9 
support already reached EOL.

### Why are the changes needed?

Jetty 9 is already beyond EOL, which means that we won't receive any 
security fix onward for Spark.

### Does this PR introduce _any_ user-facing change?

No, SNI host check is now defaulted to true on embedded Jetty, hence set it 
back to false to maintain backward compatibility.
Despite the redirect behaviour changed for trailing /, but modern browser 
should be able to pick up the 302 status code and perform redirect accordingly, 
hence there is no impact on user level.

### How was this patch tested?

Junit test case.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43765 from HiuKwok/ft-hf-SPARK-45522-jetty-upgradte.

Lead-authored-by: HiuFung Kwok 
Co-authored-by: HiuFung Kwok <37996731+hiuk...@users.noreply.github.com>
Signed-off-by: Sean Owen 
---
 LICENSE-binary |  1 -
 core/pom.xml   |  8 +---
 .../main/scala/org/apache/spark/SSLOptions.scala   |  2 +-
 .../main/scala/org/apache/spark/TestUtils.scala| 13 +
 .../scala/org/apache/spark/ui/JettyUtils.scala | 13 ++---
 .../test/scala/org/apache/spark/ui/UISuite.scala   | 22 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3  |  4 ++--
 dev/test-dependencies.sh   |  2 +-
 pom.xml|  8 +---
 .../service/cli/thrift/ThriftHttpCLIService.java   | 12 ++--
 10 files changed, 52 insertions(+), 33 deletions(-)

diff --git a/LICENSE-binary b/LICENSE-binary
index c6f291f11088..2073d85246b6 100644
--- a/LICENSE-binary
+++ b/LICENSE-binary
@@ -368,7 +368,6 @@ xerces:xercesImpl
 org.codehaus.jackson:jackson-jaxrs
 org.codehaus.jackson:jackson-xc
 org.eclipse.jetty:jetty-client
-org.eclipse.jetty:jetty-continuation
 org.eclipse.jetty:jetty-http
 org.eclipse.jetty:jetty-io
 org.eclipse.jetty:jetty-jndi
diff --git a/core/pom.xml b/core/pom.xml
index c093213bd6b9..f780551fb555 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -146,11 +146,6 @@
   jetty-http
   compile
 
-
-  org.eclipse.jetty
-  jetty-continuation
-  compile
-
 
   org.eclipse.jetty
   jetty-servlet
@@ -538,7 +533,7 @@
   true
   true
   
-
guava,protobuf-java,jetty-io,jetty-servlet,jetty-servlets,jetty-continuation,jetty-http,jetty-plus,jetty-util,jetty-server,jetty-security,jetty-proxy,jetty-client
+
guava,protobuf-java,jetty-io,jetty-servlet,jetty-servlets,jetty-http,jetty-plus,jetty-util,jetty-server,jetty-security,jetty-proxy,jetty-client
   
   true
 
@@ -558,7 +553,6 @@
   org.eclipse.jetty:jetty-http
   org.eclipse.jetty:jetty-proxy
   org.eclipse.jetty:jetty-client
-  org.eclipse.jetty:jetty-continuation
   org.eclipse.jetty:jetty-servlet
   org.eclipse.jetty:jetty-servlets
   org.eclipse.jetty:jetty-plus
diff --git a/core/src/main/scala/org/apache/spark/SSLOptions.scala 
b/core/src/main/scala/org/apache/spark/SSLOptions.scala
index 26108d885e4c..ce058cec2686 100644
--- a/core/src/main/scala/org/apache/spark/SSLOptions.scala
+++ b/core/src/main/scala/org/apache/spark/SSLOptions.scala
@@ -87,7 +87,7 @@ private[spark] case class SSLOptions(
   /**
* Creates a Jetty SSL context factory according to the SSL settings 
represented by this object.
*/
-  def createJettySslContextFactory(): Option[SslContextFactory] = {
+  def createJettySslContextFactoryServer(): Option[SslContextFactory.Server] = 
{
 if (enabled) {
   val sslContextFactory = new SslContextFactory.Server()
 
diff --git a/core/src/main/scala/org/apache/spark/TestUtils.scala 
b/core/src/main/scala/org/apache/spark/TestUtils.scala
index e85f98ff55c5..5e3078d7292b 100644
--- a/core/src/main/scala/org/apache/spark/TestUtils.scala
+++ b/core/src/main/scala/org/apache/spark/TestUtils.scala
@@ -252,6 +252,19 @@ private[spark] object TestUtils extends

(spark) branch master updated: [MINOR][SQL] Use `DecimalType.MINIMUM_ADJUSTED_SCALE` instead of magic number `6` in `Divide` class

2024-01-31 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0c7770f4de56 [MINOR][SQL] Use `DecimalType.MINIMUM_ADJUSTED_SCALE` 
instead of magic number `6` in `Divide` class
0c7770f4de56 is described below

commit 0c7770f4de560ad74e93b0902ab7a6be52c655be
Author: longfei.jiang <1251489...@qq.com>
AuthorDate: Wed Jan 31 09:40:07 2024 -0600

[MINOR][SQL] Use `DecimalType.MINIMUM_ADJUSTED_SCALE` instead of magic 
number `6` in `Divide` class

### What changes were proposed in this pull request?

Replace magic value `6` with constants `DecimalType.MINIMUM_ADJUSTED_SCALE`

### Why are the changes needed?

Magic values are less self-documenting than constant values.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT `ArithmeticExpressionSuite#"SPARK-45786: Decimal multiply, 
divide, remainder, quot"` can provide testing

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44941 from jlfsdtc/magic_value.

Authored-by: longfei.jiang <1251489...@qq.com>
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
index a0fb17cec812..9f1b42ad84d3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
@@ -810,7 +810,7 @@ case class Divide(
   DecimalType.adjustPrecisionScale(prec, scale)
 } else {
   var intDig = min(DecimalType.MAX_SCALE, p1 - s1 + s2)
-  var decDig = min(DecimalType.MAX_SCALE, max(6, s1 + p2 + 1))
+  var decDig = min(DecimalType.MAX_SCALE, 
max(DecimalType.MINIMUM_ADJUSTED_SCALE, s1 + p2 + 1))
   val diff = (intDig + decDig) - DecimalType.MAX_SCALE
   if (diff > 0) {
 decDig -= diff / 2 + 1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (0871a6f15623 -> e95e820ac137)

2024-01-31 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0871a6f15623 [SPARK-46753][PYTHON][TESTS] Fix pypy3 python test
 add e95e820ac137 [SPARK-46932] Clean up the imports in 
`pyspark.pandas.test_*`

No new revisions were added by this update.

Summary of changes:
 .../pandas/tests/connect/test_parity_categorical.py | 10 +-
 .../pandas/tests/connect/test_parity_extension.py   | 21 ++---
 .../tests/connect/test_parity_numpy_compat.py   | 20 ++--
 python/pyspark/pandas/tests/test_categorical.py | 12 ++--
 python/pyspark/pandas/tests/test_extension.py   | 11 +--
 python/pyspark/pandas/tests/test_numpy_compat.py| 12 ++--
 6 files changed, 46 insertions(+), 40 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (5d87ac6214bb -> 0871a6f15623)

2024-01-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5d87ac6214bb [SPARK-46926][PS] Add `convert_dtypes`, `infer_objects` 
and `set_axis` in fallback list
 add 0871a6f15623 [SPARK-46753][PYTHON][TESTS] Fix pypy3 python test

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_session.py  |  6 ++
 python/pyspark/sql/tests/test_udf.py  | 10 ++-
 python/pyspark/sql/tests/test_udf_profiler.py |  1 +
 python/pyspark/sql/tests/test_utils.py| 35 +++---
 python/pyspark/testing/connectutils.py| 99 ++-
 python/pyspark/testing/utils.py   |  9 +--
 6 files changed, 96 insertions(+), 64 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (8f406212aa09 -> 5d87ac6214bb)

2024-01-31 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8f406212aa09 [SPARK-46924][CORE] Fix `Load New` button in 
`Master/HistoryServer` Log UI
 add 5d87ac6214bb [SPARK-46926][PS] Add `convert_dtypes`, `infer_objects` 
and `set_axis` in fallback list

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/frame.py | 8 +++-
 python/pyspark/pandas/missing/frame.py | 4 +++-
 2 files changed, 10 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (2f917c32feeb -> 8f406212aa09)

2024-01-31 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2f917c32feeb [SPARK-46927][PYTHON] Make `assertDataFrameEqual` work 
properly without PyArrow
 add 8f406212aa09 [SPARK-46924][CORE] Fix `Load New` button in 
`Master/HistoryServer` Log UI

No new revisions were added by this update.

Summary of changes:
 .../deploy/{history/LogPage.scala => Utils.scala}  | 79 +++---
 .../spark/deploy/history/HistoryServer.scala   |  3 +-
 .../org/apache/spark/deploy/history/LogPage.scala  | 55 ++-
 .../apache/spark/deploy/master/ui/LogPage.scala| 55 ++-
 .../spark/deploy/master/ui/MasterWebUI.scala   |  2 +
 5 files changed, 37 insertions(+), 157 deletions(-)
 copy core/src/main/scala/org/apache/spark/deploy/{history/LogPage.scala => 
Utils.scala} (57%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects

2024-01-31 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 64115d9829ec [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for 
JDBC Dialects
64115d9829ec is described below

commit 64115d9829ec881b69ffe44f844845104484a025
Author: Kent Yao 
AuthorDate: Tue Jan 30 20:34:28 2024 +0800

[SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects

### What changes were proposed in this pull request?

[SPARK-46747](https://issues.apache.org/jira/browse/SPARK-46747) reported 
an issue that Postgres instances suffered from too many shared locks, which was 
caused by Spark‘s get table exist query. In this PR, we supplanted `"SELECT 1 
FROM $table LIMIT 1"` with `"SELECT 1 FROM $table WHERE 1=0"` to prevent data 
from being scanned.

### Why are the changes needed?

overhead reduction for JDBC datasources

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing JDBC v1/v2 datasouce tests.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44948 from yaooqinn/SPARK-46747.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit 031df8fa62666f14f54cf0a792f7fa2acc43afee)
Signed-off-by: Kent Yao 
---
 .../src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala| 2 +-
 .../src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala| 4 
 .../src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala | 4 
 sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala  | 7 +++
 4 files changed, 4 insertions(+), 13 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
index c58d8368150b..ab5bfdb7189a 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
@@ -142,7 +142,7 @@ abstract class JdbcDialect extends Serializable with 
Logging {
* @return The SQL query to use for checking the table.
*/
   def getTableExistsQuery(table: String): String = {
-s"SELECT * FROM $table WHERE 1=0"
+s"SELECT 1 FROM $table WHERE 1=0"
   }
 
   /**
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala
index 12882dc8e676..29eb8916bb79 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala
@@ -107,10 +107,6 @@ private case object MySQLDialect extends JdbcDialect with 
SQLConfHelper {
 schemaBuilder.result
   }
 
-  override def getTableExistsQuery(table: String): String = {
-s"SELECT 1 FROM $table LIMIT 1"
-  }
-
   override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
 
   // See https://dev.mysql.com/doc/refman/8.0/en/alter-table.html
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
index c2ca45d9143a..7f6f8dc00886 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
@@ -113,10 +113,6 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
-  override def getTableExistsQuery(table: String): String = {
-s"SELECT 1 FROM $table LIMIT 1"
-  }
-
   override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
 
   /**
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
index aa66fcd53041..7c9306b65f1e 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
@@ -1009,10 +1009,9 @@ class JDBCSuite extends QueryTest with 
SharedSparkSession {
 val h2 = JdbcDialects.get(url)
 val derby = JdbcDialects.get("jdbc:derby:db")
 val table = "weblogs"
-val defaultQuery = s"SELECT * FROM $table WHERE 1=0"
-val limitQuery = s"SELECT 1 FROM $table LIMIT 1"
-assert(MySQL.getTableExistsQuery(table) == limitQuery)
-assert(Postgres.getTableExistsQuery(table) == limitQuery)
+val defaultQuery = s"SELECT 1 FROM $table WHERE 1=0"
+assert(MySQL.getTableExistsQuery(table) == defaultQuery)
+assert(Postgres.getTableExistsQuery(table) == defaultQuery)
 assert(db2.getTableExistsQuery(table) == defaultQuery)
 assert(h2.getTableExistsQuery(table) == defaultQuery)

(spark) branch branch-3.5 updated (343ae8226161 -> d3b4537f8bd5)

2024-01-31 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


from 343ae8226161 [SPARK-46893][UI] Remove inline scripts from UI 
descriptions
 add d3b4537f8bd5 [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for 
JDBC Dialects

No new revisions were added by this update.

Summary of changes:
 .../src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala| 2 +-
 .../src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala| 4 
 .../src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala | 4 
 sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala  | 7 +++
 4 files changed, 4 insertions(+), 13 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.3 updated: [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects

2024-01-31 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new e98872fb5d07 [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for 
JDBC Dialects
e98872fb5d07 is described below

commit e98872fb5d07d570e6d0516b49a5d2e58876d1a6
Author: Kent Yao 
AuthorDate: Tue Jan 30 20:34:28 2024 +0800

[SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects

### What changes were proposed in this pull request?

[SPARK-46747](https://issues.apache.org/jira/browse/SPARK-46747) reported 
an issue that Postgres instances suffered from too many shared locks, which was 
caused by Spark‘s get table exist query. In this PR, we supplanted `"SELECT 1 
FROM $table LIMIT 1"` with `"SELECT 1 FROM $table WHERE 1=0"` to prevent data 
from being scanned.

### Why are the changes needed?

overhead reduction for JDBC datasources

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing JDBC v1/v2 datasouce tests.

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44948 from yaooqinn/SPARK-46747.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit 031df8fa62666f14f54cf0a792f7fa2acc43afee)
Signed-off-by: Kent Yao 
---
 .../src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala| 2 +-
 .../src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala| 4 
 .../src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala | 4 
 sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala  | 7 +++
 4 files changed, 4 insertions(+), 13 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
index 1e65542946af..0cd46dda62c3 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
@@ -141,7 +141,7 @@ abstract class JdbcDialect extends Serializable with 
Logging{
* @return The SQL query to use for checking the table.
*/
   def getTableExistsQuery(table: String): String = {
-s"SELECT * FROM $table WHERE 1=0"
+s"SELECT 1 FROM $table WHERE 1=0"
   }
 
   /**
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala
index 24f9bac74f86..73903d65b01a 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala
@@ -94,10 +94,6 @@ private case object MySQLDialect extends JdbcDialect with 
SQLConfHelper {
 schemaBuilder.result
   }
 
-  override def getTableExistsQuery(table: String): String = {
-s"SELECT 1 FROM $table LIMIT 1"
-  }
-
   override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
 
   // See https://dev.mysql.com/doc/refman/8.0/en/alter-table.html
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
index a668d66ee2f9..6de4d6ec71a8 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
@@ -141,10 +141,6 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
-  override def getTableExistsQuery(table: String): String = {
-s"SELECT 1 FROM $table LIMIT 1"
-  }
-
   override def isCascadingTruncateTable(): Option[Boolean] = Some(false)
 
   /**
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
index a222391c06fb..74b85f18ceef 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
@@ -989,10 +989,9 @@ class JDBCSuite extends QueryTest
 val h2 = JdbcDialects.get(url)
 val derby = JdbcDialects.get("jdbc:derby:db")
 val table = "weblogs"
-val defaultQuery = s"SELECT * FROM $table WHERE 1=0"
-val limitQuery = s"SELECT 1 FROM $table LIMIT 1"
-assert(MySQL.getTableExistsQuery(table) == limitQuery)
-assert(Postgres.getTableExistsQuery(table) == limitQuery)
+val defaultQuery = s"SELECT 1 FROM $table WHERE 1=0"
+assert(MySQL.getTableExistsQuery(table) == defaultQuery)
+assert(Postgres.getTableExistsQuery(table) == defaultQuery)
 assert(db2.getTableExistsQuery(table) == defaultQuery)
 assert(h2.getTableExistsQuery(table) == defaultQuery)
 assert(derby.getTableExistsQuery(table)

(spark) branch master updated (4f0a9df534a4 -> 2f917c32feeb)

2024-01-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 4f0a9df534a4 [SPARK-46925][PYTHON][CONNECT] Add a warning that 
instructs to install memory_profiler for memory profiling
 add 2f917c32feeb [SPARK-46927][PYTHON] Make `assertDataFrameEqual` work 
properly without PyArrow

No new revisions were added by this update.

Summary of changes:
 python/pyspark/testing/utils.py | 42 -
 1 file changed, 21 insertions(+), 21 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (ecdb38e2d059 -> 4f0a9df534a4)

2024-01-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ecdb38e2d059 [SPARK-46923][DOCS] Limit width of configuration tables
 add 4f0a9df534a4 [SPARK-46925][PYTHON][CONNECT] Add a warning that 
instructs to install memory_profiler for memory profiling

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/session.py | 17 -
 python/pyspark/sql/session.py | 15 ++-
 2 files changed, 30 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (8e29c0d8c5cd -> ecdb38e2d059)

2024-01-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8e29c0d8c5cd [SPARK-46921][BUILD] Move `ProblemFilters` that do not 
belong to `defaultExcludes` to `v40excludes`
 add ecdb38e2d059 [SPARK-46923][DOCS] Limit width of configuration tables

No new revisions were added by this update.

Summary of changes:
 connector/profiler/README.md   |  2 +-
 docs/configuration.md  | 38 -
 docs/css/custom.css| 56 --
 docs/monitoring.md |  2 +-
 docs/running-on-kubernetes.md  |  2 +-
 docs/running-on-yarn.md|  6 +--
 docs/security.md   | 18 -
 docs/spark-standalone.md   | 20 +
 docs/sql-data-sources-avro.md  |  8 ++--
 docs/sql-data-sources-hive-tables.md   |  2 +-
 docs/sql-data-sources-orc.md   |  4 +-
 docs/sql-data-sources-parquet.md   |  2 +-
 docs/sql-performance-tuning.md | 16 
 docs/sql-ref-ansi-compliance.md| 38 +++--
 docs/structured-streaming-kafka-integration.md |  8 ++--
 sql/gen-sql-config-docs.py |  4 +-
 16 files changed, 143 insertions(+), 83 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46473][SQL] Reuse `getPartitionedFile` method

(spark) branch master updated: [SPARK-46882][SS][TEST] Replace unnecessary AtomicInteger with int

(spark) branch master updated: [SPARK-46931][PS] Implement `{Frame, Series}.to_hdf`

(spark) branch master updated: [SPARK-46930][SQL] Add support for a custom prefix for Union type fields in Avro

(spark) branch master updated: [SPARK-46929][CORE][CONNECT][SS] Use ThreadUtils.shutdown to close thread pools

(spark) branch master updated: [SPARK-46400][CORE][SQL] When there are corrupted files in the local maven repo, skip this cache and try again

(spark) branch master updated: [SPARK-45522][BUILD][CORE][SQL][UI] Migrate from Jetty 9 to Jetty 10

(spark) branch master updated: [MINOR][SQL] Use `DecimalType.MINIMUM_ADJUSTED_SCALE` instead of magic number `6` in `Divide` class

(spark) branch master updated (0871a6f15623 -> e95e820ac137)

(spark) branch master updated (5d87ac6214bb -> 0871a6f15623)

(spark) branch master updated (8f406212aa09 -> 5d87ac6214bb)

(spark) branch master updated (2f917c32feeb -> 8f406212aa09)

(spark) branch branch-3.4 updated: [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects

(spark) branch branch-3.5 updated (343ae8226161 -> d3b4537f8bd5)

(spark) branch branch-3.3 updated: [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects

(spark) branch master updated (4f0a9df534a4 -> 2f917c32feeb)

(spark) branch master updated (ecdb38e2d059 -> 4f0a9df534a4)

(spark) branch master updated (8e29c0d8c5cd -> ecdb38e2d059)

18 matches

Site Navigation

Mail list logo

Footer information