[spark] branch master updated (7c51618cc26 -> b00210bd032)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7c51618cc26 [SPARK-43881][SQL][PYTHON][CONNECT] Add optional pattern for Catalog.listDatabases add b00210bd032 [SPARK-43960][PS][TESTS] DataFrameConversionTestsMixin is not tested properly No new revisions were added by this update. Summary of changes: python/pyspark/pandas/tests/test_dataframe_conversion.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43881][SQL][PYTHON][CONNECT] Add optional pattern for Catalog.listDatabases
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7c51618cc26 [SPARK-43881][SQL][PYTHON][CONNECT] Add optional pattern for Catalog.listDatabases 7c51618cc26 is described below commit 7c51618cc2627fee8e6c2983319dc5ab6060d33f Author: Jiaan Geng AuthorDate: Mon Jun 5 08:32:28 2023 +0900 [SPARK-43881][SQL][PYTHON][CONNECT] Add optional pattern for Catalog.listDatabases ### What changes were proposed in this pull request? Currently, the syntax `SHOW [NAMESPACES | DATABASES | SCHEMAS] LIKE pattern` supports an optional pattern, so as filtered out the expected databases. But the `Catalog.listDatabases` missing the function both in Catalog API and Connect Catalog API. In fact, the optional pattern is very useful. ### Why are the changes needed? This PR want add the optional pattern for `Catalog.listDatabases`. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes #41421 from beliefer/SPARK-43881. Authored-by: Jiaan Geng Signed-off-by: Hyukjin Kwon --- .../org/apache/spark/sql/catalog/Catalog.scala | 8 ++ .../apache/spark/sql/internal/CatalogImpl.scala| 12 +++ .../scala/org/apache/spark/sql/CatalogSuite.scala | 5 + .../src/main/protobuf/spark/connect/catalog.proto | 5 +- .../sql/connect/planner/SparkConnectPlanner.scala | 6 +- project/MimaExcludes.scala | 4 +- python/pyspark/sql/catalog.py | 21 - python/pyspark/sql/connect/catalog.py | 4 +- python/pyspark/sql/connect/plan.py | 8 +- python/pyspark/sql/connect/proto/catalog_pb2.py| 104 ++--- python/pyspark/sql/connect/proto/catalog_pb2.pyi | 14 +++ python/pyspark/sql/tests/test_catalog.py | 4 + .../org/apache/spark/sql/catalog/Catalog.scala | 8 ++ .../apache/spark/sql/internal/CatalogImpl.scala| 18 .../apache/spark/sql/internal/CatalogSuite.scala | 4 + 15 files changed, 164 insertions(+), 61 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala index 62167b242de..363f895db20 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala @@ -51,6 +51,14 @@ abstract class Catalog { */ def listDatabases(): Dataset[Database] + /** + * Returns a list of databases (namespaces) which name match the specify pattern and available + * within the current catalog. + * + * @since 3.5.0 + */ + def listDatabases(pattern: String): Dataset[Database] + /** * Returns a list of tables/views in the current database (namespace). This includes all * temporary views. diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala index f72a99f6675..c2ed7f4e19e 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala @@ -66,6 +66,18 @@ class CatalogImpl(sparkSession: SparkSession) extends Catalog { } } + /** + * Returns a list of databases (namespaces) which name match the specify pattern and available + * within the current catalog. + * + * @since 3.5.0 + */ + override def listDatabases(pattern: String): Dataset[Database] = { +sparkSession.newDataset(CatalogImpl.databaseEncoder) { builder => + builder.getCatalogBuilder.getListDatabasesBuilder.setPattern(pattern) +} + } + /** * Returns a list of tables/views in the current database (namespace). This includes all * temporary views. diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala index 49741842377..396f7214c04 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/CatalogSuite.scala @@ -39,6 +39,11 @@ class CatalogSuite extends RemoteSparkSession with SQLHelper { assert(dbs.length == 2) assert(dbs.map(_.name) sameElements Array(db, currentDb)) assert(dbs.map(_.catalog).distinct sameElements Array("spark_catalog")) +var databasesWithPattern =
[spark] branch master updated (595ad30e625 -> 8ae95724721)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 595ad30e625 [SPARK-43911][SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array add 8ae95724721 [SPARK-43955][BUILD] Upgrade `scalafmt` from 3.7.3 to 3.7.4 No new revisions were added by this update. Summary of changes: dev/.scalafmt.conf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-connect-go] branch master updated: [SPARK-43958] Adding support for Channel Builder
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git The following commit(s) were added to refs/heads/master by this push: new af02e0e [SPARK-43958] Adding support for Channel Builder af02e0e is described below commit af02e0eda6dd0f6d2d3e39c9822db7b5032eaa82 Author: Martin Grund AuthorDate: Sun Jun 4 20:49:09 2023 +0900 [SPARK-43958] Adding support for Channel Builder ### What changes were proposed in this pull request? Add support for parsing the connection string of Spark Connect in the same way was it's done for the other Spark Connect clients. ### Why are the changes needed? Compatibility ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #8 from grundprinzip/SPARK-43958. Authored-by: Martin Grund Signed-off-by: Hyukjin Kwon --- client/channel/channel.go | 141 client/channel/channel_test.go | 82 ++ client/sql/sparksession.go | 36 -- cmd/spark-connect-example-spark-session/main.go | 4 +- go.mod | 17 ++- go.sum | 34 -- 6 files changed, 290 insertions(+), 24 deletions(-) diff --git a/client/channel/channel.go b/client/channel/channel.go new file mode 100644 index 000..6cf7696 --- /dev/null +++ b/client/channel/channel.go @@ -0,0 +1,141 @@ +// +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +//http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package channel + +import ( + "crypto/tls" + "crypto/x509" + "errors" + "fmt" + "net" + "net/url" + "strconv" + "strings" + + "golang.org/x/oauth2" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials" + "google.golang.org/grpc/credentials/oauth" +) + +// Reserved header parameters that must not be injected as variables. +var reservedParams = []string{"user_id", "token", "use_ssl"} + +// The ChannelBuilder is used to parse the different parameters of the connection +// string according to the specification documented here: +// +// https://github.com/apache/spark/blob/master/connector/connect/docs/client-connection-string.md +type ChannelBuilder struct { + Hoststring + Portint + Token string + Userstring + Headers map[string]string +} + +// Finalizes the creation of the gprc.ClientConn by creating a GRPC channel +// with the necessary options extracted from the connection string. For +// TLS connections, this function will load the system certificates. +func (cb *ChannelBuilder) Build() (*grpc.ClientConn, error) { + var opts []grpc.DialOption + + opts = append(opts, grpc.WithAuthority(cb.Host)) + if cb.Token == "" { + opts = append(opts, grpc.WithInsecure()) + } else { + // Note: On the Windows platform, use of x509.SystemCertPool() requires + // go version 1.18 or higher. + systemRoots, err := x509.SystemCertPool() + if err != nil { + return nil, err + } + cred := credentials.NewTLS({ + RootCAs: systemRoots, + }) + opts = append(opts, grpc.WithTransportCredentials(cred)) + + t := oauth2.Token{ + AccessToken: cb.Token, + TokenType: "bearer", + } + opts = append(opts, grpc.WithPerRPCCredentials(oauth.NewOauthAccess())) + } + + remote := fmt.Sprintf("%v:%v", cb.Host, cb.Port) + conn, err := grpc.Dial(remote, opts...) + if err != nil { + return nil, fmt.Errorf("failed to connect to remote %s: %w", remote, err) + } + return conn, nil +} + +// Creates a new instance of the ChannelBuilder. This constructor effectively +// parses the connection string and extracts the relevant parameters directly. +func NewBuilder(connection string) (*ChannelBuilder,
[GitHub] [spark-website] HyukjinKwon commented on pull request #463: Remove koalas from third-party-projects
HyukjinKwon commented on PR #463: URL: https://github.com/apache/spark-website/pull/463#issuecomment-1575496366 Lgtm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [SPARK-43911][SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array
This is an automated email from the ASF dual-hosted git repository. yumwang pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 93709918aff [SPARK-43911][SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array 93709918aff is described below commit 93709918affba4846a30cbae8692a6a328b5a448 Author: mcdull-zhang AuthorDate: Sun Jun 4 15:04:33 2023 +0800 [SPARK-43911][SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array ### What changes were proposed in this pull request? When SubqueryBroadcastExec reuses the keys of Broadcast HashedRelation for dynamic partition pruning, it will put all the keys in an Array, and then call the distinct of the Array to remove the duplicates. In general, Broadcast HashedRelation may have many rows, and the repetition rate of this key is high. Doing so will cause this Array to occupy a large amount of memory (and this memory is not managed by MemoryManager), which may trigger OOM. The approach here is to directly call the toSet of the iterator to deduplicate, which can prevent the creation of a large array. ### Why are the changes needed? Avoid the occurrence of the following OOM exceptions: ```text Exception in thread "dynamicpruning-0" java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106) at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:49) at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:62) at scala.collection.generic.Growable$$Lambda$7/1514840818.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.sql.execution.SubqueryBroadcastExec.$anonfun$relationFuture$2(SubqueryBroadcastExec.scala:92) at org.apache.spark.sql.execution.SubqueryBroadcastExec$$Lambda$4212/5099232.apply(Unknown Source) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:140) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Production environment manual verification && Pass existing unit tests Closes #41419 from mcdull-zhang/reduce_memory_usage. Authored-by: mcdull-zhang Signed-off-by: Yuming Wang (cherry picked from commit 595ad30e6259f7e4e4252dfee7704b73fd4760f7) Signed-off-by: Yuming Wang --- .../scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala index 22d042ccefb..80f863515d4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala @@ -93,7 +93,7 @@ case class SubqueryBroadcastExec( val rows = if (broadcastRelation.keyIsUnique) { keyIter.toArray[InternalRow] } else { - keyIter.toArray[InternalRow].distinct + keyIter.toSet[InternalRow].toArray }
[spark] branch master updated (af61526bea7 -> 595ad30e625)
This is an automated email from the ASF dual-hosted git repository. yumwang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from af61526bea7 [SPARK-43917][PS][INFRA] Upgrade `pandas` to 2.0.2 add 595ad30e625 [SPARK-43911][SQL] Use toSet to deduplicate the iterator data to prevent the creation of large Array No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43917][PS][INFRA] Upgrade `pandas` to 2.0.2
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new af61526bea7 [SPARK-43917][PS][INFRA] Upgrade `pandas` to 2.0.2 af61526bea7 is described below commit af61526bea7d2a9e02c3d4acd691fc03695c4573 Author: Bjørn Jørgensen AuthorDate: Sat Jun 3 23:21:58 2023 -0700 [SPARK-43917][PS][INFRA] Upgrade `pandas` to 2.0.2 ### What changes were proposed in this pull request? Upgrade pandas from 2.0.0 to 2.0.2 ### Why are the changes needed? This fixes some regressions and bugs. [Whats new in 2.0.2](https://pandas.pydata.org/docs/whatsnew/v2.0.2.html) [Whats new in 2.0.1](https://pandas.pydata.org/docs/whatsnew/v2.0.1.html) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA Closes #41437 from bjornjorgensen/pandas2.0.2. Lead-authored-by: Bjørn Jørgensen Co-authored-by: bjornjorgensen Signed-off-by: Dongjoon Hyun --- dev/infra/Dockerfile | 4 ++-- python/pyspark/pandas/supported_api_gen.py | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index 888b4e00b39..3b95467389a 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -64,8 +64,8 @@ RUN Rscript -e "devtools::install_version('roxygen2', version='7.2.0', repos='ht # See more in SPARK-39735 ENV R_LIBS_SITE "/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library" -RUN pypy3 -m pip install numpy 'pandas<=2.0.0' scipy coverage matplotlib -RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.0' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' +RUN pypy3 -m pip install numpy 'pandas<=2.0.2' scipy coverage matplotlib +RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.2' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' # Add Python deps for Spark Connect. RUN python3.9 -m pip install grpcio protobuf googleapis-common-protos grpcio-status diff --git a/python/pyspark/pandas/supported_api_gen.py b/python/pyspark/pandas/supported_api_gen.py index b5d6cadd3ca..d259171ecb9 100644 --- a/python/pyspark/pandas/supported_api_gen.py +++ b/python/pyspark/pandas/supported_api_gen.py @@ -98,7 +98,7 @@ def generate_supported_api(output_rst_file_path: str) -> None: Write supported APIs documentation. """ -pandas_latest_version = "2.0.0" +pandas_latest_version = "2.0.2" if LooseVersion(pd.__version__) != LooseVersion(pandas_latest_version): msg = ( "Warning: Latest version of pandas (%s) is required to generate the documentation; " - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43953][CONNECT] Remove `pass`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7b25b241368 [SPARK-43953][CONNECT] Remove `pass` 7b25b241368 is described below commit 7b25b241368e3c6cf621785206a479d9f8524234 Author: Bjørn Jørgensen AuthorDate: Sat Jun 3 23:18:27 2023 -0700 [SPARK-43953][CONNECT] Remove `pass` ### What changes were proposed in this pull request? Remove a `pass` ### Why are the changes needed? [Delete this unreachable code or refactor the code to make it reachable.](https://sonarcloud.io/project/issues?languages=py=false=python%3AS1763=spark-python=AYTDYn77d80yCLHA8rt7) and [Remove this unneeded "pass".](https://sonarcloud.io/project/issues?languages=py=false=python%3AS2772=spark-python=AYTDYn77d80yCLHA8rt8) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA Closes #41438 from bjornjorgensen/remove-pass. Authored-by: Bjørn Jørgensen Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/connect/plan.py | 1 - 1 file changed, 1 deletion(-) diff --git a/python/pyspark/sql/connect/plan.py b/python/pyspark/sql/connect/plan.py index 8218faecd9f..2793ecb3272 100644 --- a/python/pyspark/sql/connect/plan.py +++ b/python/pyspark/sql/connect/plan.py @@ -1533,7 +1533,6 @@ class WriteOperation(LogicalPlan): f"options: '{self.options}'" f"" ) -pass class WriteOperationV2(LogicalPlan): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org