[spark] branch master updated: [SPARK-39686][INFRA][FOLLOW-UP] Disable SparkR build in branch-3.2 with Scala 2.13
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0b1b5ffc971 [SPARK-39686][INFRA][FOLLOW-UP] Disable SparkR build in branch-3.2 with Scala 2.13 0b1b5ffc971 is described below commit 0b1b5ffc97101f0b029db037a2278de78068b412 Author: Hyukjin Kwon AuthorDate: Fri Jul 8 14:24:00 2022 +0900 [SPARK-39686][INFRA][FOLLOW-UP] Disable SparkR build in branch-3.2 with Scala 2.13 ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/37091 that disables the SparkR build that has never passed in history at branch-3.2 with Scala 2.13. See also SPARK-39712 (https://github.com/apache/spark/runs/7228058532?check_suite_focus=true) ### Why are the changes needed? To have the very first green in the build. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in the scheduled jobs should test it out. Closes #37124 from HyukjinKwon/SPARK-39686-followup. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- .github/workflows/build_branch32.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_branch32.yml b/.github/workflows/build_branch32.yml index d7f69484495..439f7a3c670 100644 --- a/.github/workflows/build_branch32.yml +++ b/.github/workflows/build_branch32.yml @@ -36,12 +36,12 @@ jobs: { "SCALA_PROFILE": "scala2.13" } + # TODO(SPARK-39712): Reenable "sparkr": "true" # TODO(SPARK-39685): Reenable "lint": "true" # TODO(SPARK-39681): Reenable "pyspark": "true" # TODO(SPARK-39682): Reenable "docker-integration-tests": "true" jobs: >- { "build": "true", - "sparkr": "true", "tpcds-1g": "true" } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-38899][SQL] DS V2 supports push down datetime functions
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1df405fb122 [SPARK-38899][SQL] DS V2 supports push down datetime functions 1df405fb122 is described below commit 1df405fb122fa492e2f499b9bb1cf3ba5ecfd060 Author: chenzhx AuthorDate: Fri Jul 8 11:34:23 2022 +0800 [SPARK-38899][SQL] DS V2 supports push down datetime functions ### What changes were proposed in this pull request? Currently, Spark have some datetime functions. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L577 These functions show below: `DATE_ADD,` `DATEDIFF`, `TRUNC`, `EXTRACT`, `SECOND`, `MINUTE`, `HOUR`, `MONTH`, `QUARTER`, `YEAR`, `DAYOFWEEK`, `DAYOFMONTH`, `DAYOFYEAR` The mainstream databases support these functions show below. Function|PostgreSQL|ClickHouse|H2|MySQL|Oracle|Presto|Teradata|Snowflake|DB2|Vertica|Exasol|Impala|Mariadb|Druid|Singlestore|ElasticSearch -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- `DateAdd`|No|Yes|Yes|Yes|Yes|Yes|No|Yes|No|No|No|Yes|Yes|No|Yes|Yes `DateDiff`|No|Yes|Yes|Yes|Yes|Yes|No|Yes|No|Yes|No|Yes|Yes|No|Yes|Yes `DateTrunc`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| Yes|Yes|Yes|Yes|No|Yes|Yes|Yes `Hour`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes `Minute`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes `Month`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes `Quarter`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes `Second`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes `Year`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes `DayOfMonth`|Yes|Yes|Yes|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes `DayOfWeek`|Yes|Yes|Yes|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes|Yes|Yes|Yes|Yes `DayOfYear`|Yes|Yes|Yes|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes|Yes|Yes|Yes|Yes `WEEK_OF_YEAR`|Yes|No|Yes|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes|Yes|Yes|Yes|Yes `YEAR_OF_WEEK`|No|No|Yes|Yes|Yes|Yes|No|Yes|No|No|No|No|Yes|No|No|No DS V2 should supports push down these datetime functions. ### Why are the changes needed? DS V2 supports push down datetime functions. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes #36663 from chenzhx/datetime. Authored-by: chenzhx Signed-off-by: Wenchen Fan --- .../spark/sql/connector/expressions/Extract.java | 62 + .../expressions/GeneralScalarExpression.java | 18 +++ .../sql/connector/util/V2ExpressionSQLBuilder.java | 11 ++ .../sql/catalyst/util/V2ExpressionBuilder.scala| 57 +++- .../org/apache/spark/sql/jdbc/H2Dialect.scala | 26 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 146 + 6 files changed, 296 insertions(+), 24 deletions(-) diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Extract.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Extract.java new file mode 100644 index 000..a925f1ee31a --- /dev/null +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Extract.java @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.expressions; + +import org.apache.spark.annotation.Evolving; + +import java.io.Serializable; + +/** + * Represent an extract function, which extracts and returns the value of a + * specified datetime field from a datetime or interval value expression. + * + * The currently supported fields names following the ISO standard: + * + * SECOND Since 3.4.0 + * MINUTE Since 3.4.0 + * HOUR Since 3.4.0 + * MONTH Since 3.4.0 + * QUARTER Since
[spark] branch master updated: [SPARK-39693][INFRA] Do Not Execute tpcds-1g-gen for Benchmarks Other Than TPCDSQueryBenchmark
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 231d3760fe5 [SPARK-39693][INFRA] Do Not Execute tpcds-1g-gen for Benchmarks Other Than TPCDSQueryBenchmark 231d3760fe5 is described below commit 231d3760fe587973e3c1699912015907d6b26766 Author: Kazuyuki Tanimura AuthorDate: Fri Jul 8 09:26:35 2022 +0900 [SPARK-39693][INFRA] Do Not Execute tpcds-1g-gen for Benchmarks Other Than TPCDSQueryBenchmark ### What changes were proposed in this pull request? Currently `tpcds-1g-gen` runs for any benchmarks even that do not require TPC-DS data on Github Actions. This PR proposes to skip running `tpcds-1g-gen` if the benchmark class does not contain `TPCDSQueryBenchmark` or `*` based on the discussion on #37020 ### Why are the changes needed? This PR should save time to launch benchmarks on Github Actions ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested on Github Actions. Closes #37120 from kazuyukitanimura/SPARK-39693. Authored-by: Kazuyuki Tanimura Signed-off-by: Hyukjin Kwon --- .github/workflows/benchmark.yml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index 3170c7c6bb0..4a5fd661c78 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -59,6 +59,7 @@ jobs: # Any TPC-DS related updates on this job need to be applied to tpcds-1g job of build_and_test.yml as well tpcds-1g-gen: name: "Generate an input dataset for TPCDSQueryBenchmark with SF=1" +if: contains(github.event.inputs.class, 'TPCDSQueryBenchmark') || contains(github.event.inputs.class, '*') runs-on: ubuntu-20.04 env: SPARK_LOCAL_IP: localhost @@ -113,6 +114,7 @@ jobs: benchmark: name: "Run benchmarks: ${{ github.event.inputs.class }} (JDK ${{ github.event.inputs.jdk }}, Scala ${{ github.event.inputs.scala }}, ${{ matrix.split }} out of ${{ github.event.inputs.num-splits }} splits)" +if: always() needs: [matrix-gen, tpcds-1g-gen] # Ubuntu 20.04 is the latest LTS. The next LTS is 22.04. runs-on: ubuntu-20.04 @@ -158,6 +160,7 @@ jobs: with: java-version: ${{ github.event.inputs.jdk }} - name: Cache TPC-DS generated data + if: contains(github.event.inputs.class, 'TPCDSQueryBenchmark') || contains(github.event.inputs.class, '*') id: cache-tpcds-sf-1 uses: actions/cache@v2 with: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch dependabot/maven/org.eclipse.jetty-jetty-server-10.0.10 created (now ddc419dce6e)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/org.eclipse.jetty-jetty-server-10.0.10 in repository https://gitbox.apache.org/repos/asf/spark.git at ddc419dce6e Bump jetty-server from 9.4.46.v20220331 to 10.0.10 No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch dependabot/maven/org.eclipse.jetty-jetty-http-9.4.48.v20220622 created (now 86069eb5d7f)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch dependabot/maven/org.eclipse.jetty-jetty-http-9.4.48.v20220622 in repository https://gitbox.apache.org/repos/asf/spark.git at 86069eb5d7f Bump jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (fe7b8fcd6fe -> 7dcb4bafd02)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from fe7b8fcd6fe [SPARK-33753][CORE] Reduce the memory footprint and gc of the cache (hadoopJobMetadata) add 7dcb4bafd02 [SPARK-39385][SQL] Translate linear regression aggregate functions for pushdown No new revisions were added by this update. Summary of changes: .../aggregate/GeneralAggregateFunc.java| 4 ++ .../execution/datasources/DataSourceStrategy.scala | 48 +++-- .../org/apache/spark/sql/jdbc/H2Dialect.scala | 12 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 80 +++--- 4 files changed, 109 insertions(+), 35 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (32aff86477a -> c5983c1691f)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git from 32aff86477a [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast add c5983c1691f [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to handle CalendarIntervalType correctly No new revisions were added by this update. Summary of changes: .../spark/sql/execution/vectorized/ColumnVectorUtils.java | 3 ++- .../spark/sql/execution/vectorized/ColumnVectorSuite.scala| 11 ++- 2 files changed, 12 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 32aff86477a [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast 32aff86477a is described below commit 32aff86477ac001b0ee047db08591d89e90c6eb8 Author: ulysses-you AuthorDate: Thu Jul 7 22:49:03 2022 +0800 [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast This is a backport of https://github.com/apache/spark/pull/36974 for branch-3.2 ### What changes were proposed in this pull request? Change `currentPhysicalPlan` to `inputPlan ` when we restore the broadcast exchange for DPP. ### Why are the changes needed? The currentPhysicalPlan can be wrapped with broadcast query stage so it is not safe to match it. For example: The broadcast exchange which is added by DPP is running before than the normal broadcast exchange(e.g. introduced by join). ### Does this PR introduce _any_ user-facing change? yes bug fix ### How was this patch tested? add test Closes #37087 from ulysses-you/inputplan-3.2. Authored-by: ulysses-you Signed-off-by: Wenchen Fan --- .../execution/adaptive/AdaptiveSparkPlanExec.scala| 2 +- .../spark/sql/DynamicPartitionPruningSuite.scala | 19 +++ 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala index e6c8be1397e..7aeb1c34329 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala @@ -658,7 +658,7 @@ case class AdaptiveSparkPlanExec( // node to prevent the loss of the `BroadcastExchangeExec` node in DPP subquery. // Here, we also need to avoid to insert the `BroadcastExchangeExec` node when the newPlan is // already the `BroadcastExchangeExec` plan after apply the `LogicalQueryStageStrategy` rule. - val finalPlan = currentPhysicalPlan match { + val finalPlan = inputPlan match { case b: BroadcastExchangeLike if (!newPlan.isInstanceOf[BroadcastExchangeLike]) => b.withNewChildren(Seq(newPlan)) case _ => newPlan diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala index 89749e7de00..91176717774 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala @@ -1597,6 +1597,25 @@ class DynamicPartitionPruningV1SuiteAEOff extends DynamicPartitionPruningV1Suite class DynamicPartitionPruningV1SuiteAEOn extends DynamicPartitionPruningV1Suite with EnableAdaptiveExecutionSuite { + test("SPARK-39447: Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast") { +val df = sql( + """ +|WITH empty_result AS ( +| SELECT * FROM fact_stats WHERE product_id < 0 +|) +|SELECT * +|FROM (SELECT /*+ SHUFFLE_MERGE(fact_sk) */ empty_result.store_id +|FROM fact_sk +| JOIN empty_result +| ON fact_sk.product_id = empty_result.product_id) t2 +| JOIN empty_result +| ON t2.store_id = empty_result.store_id + """.stripMargin) + +checkPartitionPruningPredicate(df, false, false) +checkAnswer(df, Nil) + } + test("SPARK-37995: PlanAdaptiveDynamicPruningFilters should use prepareExecutedPlan " + "rather than createSparkPlan to re-plan subquery") { withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5adcddb87a0 -> fe7b8fcd6fe)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5adcddb87a0 [SPARK-39695][SQL] Add the `REGEXP_SUBSTR` function add fe7b8fcd6fe [SPARK-33753][CORE] Reduce the memory footprint and gc of the cache (hadoopJobMetadata) No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/SparkEnv.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39695][SQL] Add the `REGEXP_SUBSTR` function
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5adcddb87a0 [SPARK-39695][SQL] Add the `REGEXP_SUBSTR` function 5adcddb87a0 is described below commit 5adcddb87a052ce8e3b3c917c61f019bea5532ae Author: Max Gekk AuthorDate: Thu Jul 7 11:22:41 2022 +0300 [SPARK-39695][SQL] Add the `REGEXP_SUBSTR` function ### What changes were proposed in this pull request? In the PR, I propose to add new expression `RegExpSubStr` as a runtime replaceable expression of `NullIf` and `RegExpExtract`. And bind the expression to the function name `REGEXP_SUBSTR`. The `REGEXP_SUBSTR` function returns the substring that matches a regular expression within a string. It takes two parameters: 1. An expression that specifies the string in which the search is to take place. 2. An expression that specifies the regular expression string that is the pattern for the search. If the regular expression is not found, the result is **null** (this behaviour is similar to other DBMSs). When any of the input parameters are NULL, the function returns NULL too. For example: ```sql spark-sql> CREATE TABLE log (logs string); spark-sql> INSERT INTO log (logs) VALUES > ('127.0.0.1 - - [10/Jan/2022:16:55:36 -0800] "GET / HTTP/1.0" 200 2217'), > ('192.168.1.99 - - [14/Feb/2022:10:27:10 -0800] "GET /cgi-bin/try/ HTTP/1.0" 200 3396'); spark-sql> SELECT REGEXP_SUBSTR (logs,'\\b\\d{1,3}\.\\d{1,3}\.\\d{1,3}\.\\d{1,3}\\b') AS IP, REGEXP_SUBSTR (logs,'([\\w:\/]+\\s[+\-]\\d{4})') AS DATE FROM log; 127.0.0.1 10/Jan/2022:16:55:36 -0800 192.168.1.9914/Feb/2022:10:27:10 -0800 ``` ### Why are the changes needed? To make the migration process from other systems to Spark SQL easier, and achieve feature parity to such systems. For example, the systems below support the `REGEXP_SUBSTR` function, see: - Oracle: https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm - DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-substr - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html - BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr - Redshift: https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html - MariaDB: https://mariadb.com/kb/en/regexp_substr/ - Exasol DB: https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_substr.htm ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new tests: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql" $ build/sbt "sql/testOnly *ExpressionsSchemaSuite" $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` Closes #37101 from MaxGekk/regexp_substr. Authored-by: Max Gekk Signed-off-by: Max Gekk --- .../sql/catalyst/analysis/FunctionRegistry.scala | 1 + .../catalyst/expressions/regexpExpressions.scala | 39 +++ .../sql-functions/sql-expression-schema.md | 1 + .../sql-tests/inputs/regexp-functions.sql | 9 .../sql-tests/results/regexp-functions.sql.out | 56 ++ 5 files changed, 106 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index 52d84cfa175..20c719aec68 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -585,6 +585,7 @@ object FunctionRegistry { expression[XPathShort]("xpath_short"), expression[XPathString]("xpath_string"), expression[RegExpCount]("regexp_count"), +expression[RegExpSubStr]("regexp_substr"), // datetime functions expression[AddMonths]("add_months"), diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala index 8d813058296..b240e849f4d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala @@ -1004,3 +1004,42 @@ case class RegExpCount(left: Expression, right: Expression) newChildren: IndexedSeq[Expression]): RegExpCount = copy(left = newChildren(0), right = newChildren(1)) } + +//
[spark] branch master updated: [SPARK-39689][SQL] Support 2-chars `lineSep` in CSV datasource
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bb4c4778713 [SPARK-39689][SQL] Support 2-chars `lineSep` in CSV datasource bb4c4778713 is described below commit bb4c4778713c7ba1ee92d0bb0763d7d3ce54374f Author: yaohua AuthorDate: Thu Jul 7 15:22:06 2022 +0900 [SPARK-39689][SQL] Support 2-chars `lineSep` in CSV datasource ### What changes were proposed in this pull request? Univocity parser allows to set line separator to 1 to 2 characters ([code](https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/Format.java#L103)), CSV options should not block this usage ([code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala#L218)). This PR updates the requirement of `lineSep` accepting 1 or 2 characters in `CSVOptions`. Due to the limitation of supporting multi-chars `lineSep` within quotes, I propose to leave this feature undocumented and add a WARN message to users. ### Why are the changes needed? Unblock the usability of 2 characters `lineSep`. ### Does this PR introduce _any_ user-facing change? No - undocumented feature. ### How was this patch tested? New UT. Closes #37107 from Yaohua628/spark-39689. Lead-authored-by: yaohua Co-authored-by: Yaohua Zhao <79476540+yaohua...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- .../apache/spark/sql/catalyst/csv/CSVOptions.scala | 6 +++- .../sql/execution/datasources/csv/CSVSuite.scala | 35 ++ 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala index 9daa50ba5a4..3e92c3d25eb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala @@ -215,7 +215,11 @@ class CSVOptions( */ val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => require(sep.nonEmpty, "'lineSep' cannot be an empty string.") -require(sep.length == 1, "'lineSep' can contain only 1 character.") +// Intentionally allow it up to 2 for Window's CRLF although multiple +// characters have an issue with quotes. This is intentionally undocumented. +require(sep.length <= 2, "'lineSep' can contain only 1 character.") +if (sep.length == 2) logWarning("It is not recommended to set 'lineSep' " + + "with 2 characters due to the limitation of supporting multi-char 'lineSep' within quotes.") sep } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index 62dccaad1dd..bf92ffcf465 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala @@ -34,6 +34,7 @@ import org.apache.commons.lang3.exception.ExceptionUtils import org.apache.commons.lang3.time.FastDateFormat import org.apache.hadoop.io.SequenceFile.CompressionType import org.apache.hadoop.io.compress.GzipCodec +import org.apache.logging.log4j.Level import org.apache.spark.{SparkConf, SparkException, TestUtils} import org.apache.spark.sql.{AnalysisException, Column, DataFrame, Encoders, QueryTest, Row} @@ -2296,6 +2297,40 @@ abstract class CSVSuite assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + Seq(true, false).foreach { multiLine => +test(s"""lineSep with 2 chars when multiLine set to $multiLine""") { + Seq("\r\n", "||", "|").foreach { newLine => +val logAppender = new LogAppender("lineSep WARN logger") +withTempDir { dir => + val inputData = if (multiLine) { +s"""name,"i am the${newLine} column1"${newLine}jack,30${newLine}tom,18""" + } else { +s"name,age${newLine}jack,30${newLine}tom,18" + } + Files.write(new File(dir, "/data.csv").toPath, inputData.getBytes()) + withLogAppender(logAppender) { +val df = spark.read + .options( +Map("header" -> "true", "multiLine" -> multiLine.toString, "lineSep" -> newLine)) + .csv(dir.getCanonicalPath) +// Due to the limitation of Univocity parser: +// multiple chars of newlines cannot be properly handled when they exist within quotes. +// Leave 2-char lineSep as an undocumented features and logWarn user +if (newLine
[spark] branch master updated: [SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 for the changes in DeployMessages
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 845950b72b6 [SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 for the changes in DeployMessages 845950b72b6 is described below commit 845950b72b63f94b03436a598d9d041e662a0b53 Author: Hyukjin Kwon AuthorDate: Thu Jul 7 15:21:25 2022 +0900 [SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 for the changes in DeployMessages ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/36716. Mima with Scala 2.13 complains about the changes in `DeployMessages` for some reasons: ``` [error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.2.0! Found 6 potential problems (filtered 933) [error] * the type hierarchy of object org.apache.spark.deploy.DeployMessages#LaunchExecutor is different in current version. Missing types {scala.runtime.AbstractFunction7} [error]filter with: ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.deploy.DeployMessages$LaunchExecutor$") [error] * method requestedTotal()Int in class org.apache.spark.deploy.DeployMessages#RequestExecutors does not have a correspondent in current version [error]filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.requestedTotal") [error] * method copy(java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors in class org.apache.spark.deploy.DeployMessages#RequestExecutors's type is different in current version, where it is (java.lang.String,scala.collection.immutable.Map)org.apache.spark.deploy.DeployMessages#RequestExecutors instead of (java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors [error]filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.copy") [error] * synthetic method copy$default$2()Int in class org.apache.spark.deploy.DeployMessages#RequestExecutors has a different result type in current version, where it is scala.collection.immutable.Map rather than Int [error]filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.copy$default$2") [error] * method this(java.lang.String,Int)Unit in class org.apache.spark.deploy.DeployMessages#RequestExecutors's type is different in current version, where it is (java.lang.String,scala.collection.immutable.Map)Unit instead of (java.lang.String,Int)Unit [error]filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.this") [error] * method apply(java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors in object org.apache.spark.deploy.DeployMessages#RequestExecutors in current version does not have a correspondent with same parameter signature among (java.lang.String,scala.collection.immutable.Map)org.apache.spark.deploy.DeployMessages#RequestExecutors, (java.lang.Object,java.lang.Object)java.lang.Object [error]filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.apply") ``` https://github.com/apache/spark/runs/7221231391?check_suite_focus=true This PR adds the suggested filters. ### Why are the changes needed? To make the scheduled build (Scala 2.13) pass in https://github.com/apache/spark/actions/workflows/build_scala213.yml ### Does this PR introduce _any_ user-facing change? No, dev-only. The alarms are false positive. ### How was this patch tested? CI should verify this, Closes #37109 from HyukjinKwon/SPARK-39703. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- project/MimaExcludes.scala | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala index fb71155657f..3f3d8575477 100644 --- a/project/MimaExcludes.scala +++ b/project/MimaExcludes.scala @@ -54,7 +54,15 @@ object MimaExcludes { ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.Classifier.getNumClasses"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.Classifier.getNumClasses$default$2"), ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.OneVsRest.extractInstances"), - ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.OneVsRestModel.extractInstances") +
[spark] branch master updated: [SPARK-39679][SQL] TakeOrderedAndProjectExec should respect child output ordering
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 427fbee4c00 [SPARK-39679][SQL] TakeOrderedAndProjectExec should respect child output ordering 427fbee4c00 is described below commit 427fbee4c009d8d49fdb80a2e2532723eff84150 Author: ulysses-you AuthorDate: Thu Jul 7 14:20:29 2022 +0800 [SPARK-39679][SQL] TakeOrderedAndProjectExec should respect child output ordering ### What changes were proposed in this pull request? Skip local sort in `TakeOrderedAndProjectExec` if child output ordering satisfies the required. ### Why are the changes needed? TakeOrderedAndProjectExec should respect child output ordering to avoid unnecessary sort. For example: TakeOrderedAndProjectExec on the top of SortMergeJoin. ```SQL SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100; ``` ### Does this PR introduce _any_ user-facing change? no, only improve performance ### How was this patch tested? Add benchmark test: ```sql val row = 10 * 1000 val df1 = spark.range(0, row, 1, 2).selectExpr("id % 3 as c1") val df2 = spark.range(0, row, 1, 2).selectExpr("id % 3 as c2") df1.join(df2, col("c1") === col("c2")) .orderBy(col("c1")) .limit(100) .noop() ``` Before: ``` TakeOrderedAndProject OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure Intel(R) Xeon(R) Platinum 8370C CPU 2.80GHz TakeOrderedAndProject with SMJ:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - TakeOrderedAndProject with SMJ for doExecute3356 3414 61 0.0 335569.5 1.0X TakeOrderedAndProject with SMJ for executeCollect 3331 3370 47 0.0 333118.0 1.0X OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure Intel(R) Xeon(R) Platinum 8272CL CPU 2.60GHz TakeOrderedAndProject with SMJ:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - TakeOrderedAndProject with SMJ for doExecute3745 3766 24 0.0 374477.3 1.0X TakeOrderedAndProject with SMJ for executeCollect 3657 3680 38 0.0 365703.4 1.0X OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure Intel(R) Xeon(R) Platinum 8272CL CPU 2.60GHz TakeOrderedAndProject with SMJ:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - TakeOrderedAndProject with SMJ for doExecute2499 2554 47 0.0 249945.5 1.0X TakeOrderedAndProject with SMJ for executeCollect 2510 2515 8 0.0 250956.9 1.0X ``` After: ``` TakeOrderedAndProject OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure Intel(R) Xeon(R) Platinum 8171M CPU 2.60GHz TakeOrderedAndProject with SMJ:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - TakeOrderedAndProject with SMJ for doExecute 287 337 43 0.0 28734.9 1.0X TakeOrderedAndProject with SMJ for executeCollect150 170 30 0.1 15037.8 1.9X OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure Intel(R) Xeon(R) Platinum 8272CL CPU 2.60GHz TakeOrderedAndProject with SMJ:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative