[spark] branch master updated: [SPARK-39686][INFRA][FOLLOW-UP] Disable SparkR build in branch-3.2 with Scala 2.13

2022-07-07 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0b1b5ffc971 [SPARK-39686][INFRA][FOLLOW-UP] Disable SparkR build in 
branch-3.2 with Scala 2.13
0b1b5ffc971 is described below

commit 0b1b5ffc97101f0b029db037a2278de78068b412
Author: Hyukjin Kwon 
AuthorDate: Fri Jul 8 14:24:00 2022 +0900

[SPARK-39686][INFRA][FOLLOW-UP] Disable SparkR build in branch-3.2 with 
Scala 2.13

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/37091 that 
disables the SparkR build that has never passed in history at branch-3.2 with 
Scala 2.13.

See also SPARK-39712 
(https://github.com/apache/spark/runs/7228058532?check_suite_focus=true)

### Why are the changes needed?

To have the very first green in the build.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

CI in the scheduled jobs should test it out.

Closes #37124 from HyukjinKwon/SPARK-39686-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_branch32.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_branch32.yml 
b/.github/workflows/build_branch32.yml
index d7f69484495..439f7a3c670 100644
--- a/.github/workflows/build_branch32.yml
+++ b/.github/workflows/build_branch32.yml
@@ -36,12 +36,12 @@ jobs:
 {
   "SCALA_PROFILE": "scala2.13"
 }
+  # TODO(SPARK-39712): Reenable "sparkr": "true"
   # TODO(SPARK-39685): Reenable "lint": "true"
   # TODO(SPARK-39681): Reenable "pyspark": "true"
   # TODO(SPARK-39682): Reenable "docker-integration-tests": "true"
   jobs: >-
 {
   "build": "true",
-  "sparkr": "true",
   "tpcds-1g": "true"
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-38899][SQL] DS V2 supports push down datetime functions

2022-07-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1df405fb122 [SPARK-38899][SQL] DS V2 supports push down datetime 
functions
1df405fb122 is described below

commit 1df405fb122fa492e2f499b9bb1cf3ba5ecfd060
Author: chenzhx 
AuthorDate: Fri Jul 8 11:34:23 2022 +0800

[SPARK-38899][SQL] DS V2 supports push down datetime functions

### What changes were proposed in this pull request?

Currently, Spark have some datetime functions. Please refer

https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L577

These functions show below:
`DATE_ADD,`
`DATEDIFF`,
`TRUNC`,
`EXTRACT`,
`SECOND`,
`MINUTE`,
`HOUR`,
`MONTH`,
`QUARTER`,
`YEAR`,
`DAYOFWEEK`,
`DAYOFMONTH`,
`DAYOFYEAR`

The mainstream databases support these functions show below.

Function|PostgreSQL|ClickHouse|H2|MySQL|Oracle|Presto|Teradata|Snowflake|DB2|Vertica|Exasol|Impala|Mariadb|Druid|Singlestore|ElasticSearch
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 
-- | --
`DateAdd`|No|Yes|Yes|Yes|Yes|Yes|No|Yes|No|No|No|Yes|Yes|No|Yes|Yes
`DateDiff`|No|Yes|Yes|Yes|Yes|Yes|No|Yes|No|Yes|No|Yes|Yes|No|Yes|Yes
`DateTrunc`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| Yes|Yes|Yes|Yes|No|Yes|Yes|Yes
`Hour`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes
`Minute`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes
`Month`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes
`Quarter`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes
`Second`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes
`Year`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes
`DayOfMonth`|Yes|Yes|Yes|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes
`DayOfWeek`|Yes|Yes|Yes|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes|Yes|Yes|Yes|Yes
`DayOfYear`|Yes|Yes|Yes|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes|Yes|Yes|Yes|Yes
`WEEK_OF_YEAR`|Yes|No|Yes|Yes|Yes|Yes|No|Yes|Yes|Yes|No|Yes|Yes|Yes|Yes|Yes
`YEAR_OF_WEEK`|No|No|Yes|Yes|Yes|Yes|No|Yes|No|No|No|No|Yes|No|No|No

DS V2 should supports push down these datetime functions.

### Why are the changes needed?

DS V2 supports push down datetime functions.

### Does this PR introduce _any_ user-facing change?

'No'.
New feature.

### How was this patch tested?

New tests.

Closes #36663 from chenzhx/datetime.

Authored-by: chenzhx 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/connector/expressions/Extract.java   |  62 +
 .../expressions/GeneralScalarExpression.java   |  18 +++
 .../sql/connector/util/V2ExpressionSQLBuilder.java |  11 ++
 .../sql/catalyst/util/V2ExpressionBuilder.scala|  57 +++-
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  |  26 
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 146 +
 6 files changed, 296 insertions(+), 24 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Extract.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Extract.java
new file mode 100644
index 000..a925f1ee31a
--- /dev/null
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Extract.java
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.expressions;
+
+import org.apache.spark.annotation.Evolving;
+
+import java.io.Serializable;
+
+/**
+ * Represent an extract function, which extracts and returns the value of a
+ * specified datetime field from a datetime or interval value expression.
+ * 
+ * The currently supported fields names following the ISO standard:
+ * 
+ *   SECOND Since 3.4.0 
+ *   MINUTE Since 3.4.0 
+ *   HOUR Since 3.4.0 
+ *   MONTH Since 3.4.0 
+ *   QUARTER Since 

[spark] branch master updated: [SPARK-39693][INFRA] Do Not Execute tpcds-1g-gen for Benchmarks Other Than TPCDSQueryBenchmark

2022-07-07 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 231d3760fe5 [SPARK-39693][INFRA] Do Not Execute tpcds-1g-gen for 
Benchmarks Other Than TPCDSQueryBenchmark
231d3760fe5 is described below

commit 231d3760fe587973e3c1699912015907d6b26766
Author: Kazuyuki Tanimura 
AuthorDate: Fri Jul 8 09:26:35 2022 +0900

[SPARK-39693][INFRA] Do Not Execute tpcds-1g-gen for Benchmarks Other Than 
TPCDSQueryBenchmark

### What changes were proposed in this pull request?
Currently `tpcds-1g-gen` runs for any benchmarks even that do not require 
TPC-DS data on Github Actions.

This PR proposes to skip running `tpcds-1g-gen` if the benchmark class does 
not contain `TPCDSQueryBenchmark` or `*` based on the discussion on #37020

### Why are the changes needed?
This PR should save time to launch benchmarks on Github Actions

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested on Github Actions.

Closes #37120 from kazuyukitanimura/SPARK-39693.

Authored-by: Kazuyuki Tanimura 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/benchmark.yml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
index 3170c7c6bb0..4a5fd661c78 100644
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@@ -59,6 +59,7 @@ jobs:
   # Any TPC-DS related updates on this job need to be applied to tpcds-1g job 
of build_and_test.yml as well
   tpcds-1g-gen:
 name: "Generate an input dataset for TPCDSQueryBenchmark with SF=1"
+if: contains(github.event.inputs.class, 'TPCDSQueryBenchmark') || 
contains(github.event.inputs.class, '*')
 runs-on: ubuntu-20.04
 env:
   SPARK_LOCAL_IP: localhost
@@ -113,6 +114,7 @@ jobs:
 
   benchmark:
 name: "Run benchmarks: ${{ github.event.inputs.class }} (JDK ${{ 
github.event.inputs.jdk }}, Scala ${{ github.event.inputs.scala }}, ${{ 
matrix.split }} out of ${{ github.event.inputs.num-splits }} splits)"
+if: always()
 needs: [matrix-gen, tpcds-1g-gen]
 # Ubuntu 20.04 is the latest LTS. The next LTS is 22.04.
 runs-on: ubuntu-20.04
@@ -158,6 +160,7 @@ jobs:
   with:
 java-version: ${{ github.event.inputs.jdk }}
 - name: Cache TPC-DS generated data
+  if: contains(github.event.inputs.class, 'TPCDSQueryBenchmark') || 
contains(github.event.inputs.class, '*')
   id: cache-tpcds-sf-1
   uses: actions/cache@v2
   with:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch dependabot/maven/org.eclipse.jetty-jetty-server-10.0.10 created (now ddc419dce6e)

2022-07-07 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/org.eclipse.jetty-jetty-server-10.0.10
in repository https://gitbox.apache.org/repos/asf/spark.git


  at ddc419dce6e Bump jetty-server from 9.4.46.v20220331 to 10.0.10

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch dependabot/maven/org.eclipse.jetty-jetty-http-9.4.48.v20220622 created (now 86069eb5d7f)

2022-07-07 Thread github-bot
This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a change to branch 
dependabot/maven/org.eclipse.jetty-jetty-http-9.4.48.v20220622
in repository https://gitbox.apache.org/repos/asf/spark.git


  at 86069eb5d7f Bump jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (fe7b8fcd6fe -> 7dcb4bafd02)

2022-07-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from fe7b8fcd6fe [SPARK-33753][CORE] Reduce the memory footprint and gc of 
the cache (hadoopJobMetadata)
 add 7dcb4bafd02 [SPARK-39385][SQL] Translate linear regression aggregate 
functions for pushdown

No new revisions were added by this update.

Summary of changes:
 .../aggregate/GeneralAggregateFunc.java|  4 ++
 .../execution/datasources/DataSourceStrategy.scala | 48 +++--
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  | 12 
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 80 +++---
 4 files changed, 109 insertions(+), 35 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated (32aff86477a -> c5983c1691f)

2022-07-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


from 32aff86477a [SPARK-39447][SQL][3.2] Avoid AssertionError in 
AdaptiveSparkPlanExec.doExecuteBroadcast
 add c5983c1691f [SPARK-38018][SQL][3.2] Fix ColumnVectorUtils.populate to 
handle CalendarIntervalType correctly

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/vectorized/ColumnVectorUtils.java |  3 ++-
 .../spark/sql/execution/vectorized/ColumnVectorSuite.scala| 11 ++-
 2 files changed, 12 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-39447][SQL][3.2] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

2022-07-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 32aff86477a [SPARK-39447][SQL][3.2] Avoid AssertionError in 
AdaptiveSparkPlanExec.doExecuteBroadcast
32aff86477a is described below

commit 32aff86477ac001b0ee047db08591d89e90c6eb8
Author: ulysses-you 
AuthorDate: Thu Jul 7 22:49:03 2022 +0800

[SPARK-39447][SQL][3.2] Avoid AssertionError in 
AdaptiveSparkPlanExec.doExecuteBroadcast

This is a backport of https://github.com/apache/spark/pull/36974 for 
branch-3.2

### What changes were proposed in this pull request?

Change `currentPhysicalPlan` to `inputPlan ` when we restore the broadcast 
exchange for DPP.

### Why are the changes needed?

The currentPhysicalPlan can be wrapped with broadcast query stage so it is 
not safe to match it. For example:
 The broadcast exchange which is added by DPP is running before than the 
normal broadcast exchange(e.g. introduced by join).

### Does this PR introduce _any_ user-facing change?

yes bug fix

### How was this patch tested?

add test

Closes #37087 from ulysses-you/inputplan-3.2.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../execution/adaptive/AdaptiveSparkPlanExec.scala|  2 +-
 .../spark/sql/DynamicPartitionPruningSuite.scala  | 19 +++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
index e6c8be1397e..7aeb1c34329 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala
@@ -658,7 +658,7 @@ case class AdaptiveSparkPlanExec(
   // node to prevent the loss of the `BroadcastExchangeExec` node in DPP 
subquery.
   // Here, we also need to avoid to insert the `BroadcastExchangeExec` 
node when the newPlan is
   // already the `BroadcastExchangeExec` plan after apply the 
`LogicalQueryStageStrategy` rule.
-  val finalPlan = currentPhysicalPlan match {
+  val finalPlan = inputPlan match {
 case b: BroadcastExchangeLike
   if (!newPlan.isInstanceOf[BroadcastExchangeLike]) => 
b.withNewChildren(Seq(newPlan))
 case _ => newPlan
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
index 89749e7de00..91176717774 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala
@@ -1597,6 +1597,25 @@ class DynamicPartitionPruningV1SuiteAEOff extends 
DynamicPartitionPruningV1Suite
 class DynamicPartitionPruningV1SuiteAEOn extends DynamicPartitionPruningV1Suite
   with EnableAdaptiveExecutionSuite {
 
+  test("SPARK-39447: Avoid AssertionError in 
AdaptiveSparkPlanExec.doExecuteBroadcast") {
+val df = sql(
+  """
+|WITH empty_result AS (
+|  SELECT * FROM fact_stats WHERE product_id < 0
+|)
+|SELECT *
+|FROM   (SELECT /*+ SHUFFLE_MERGE(fact_sk) */ empty_result.store_id
+|FROM   fact_sk
+|   JOIN empty_result
+| ON fact_sk.product_id = empty_result.product_id) t2
+|   JOIN empty_result
+| ON t2.store_id = empty_result.store_id
+  """.stripMargin)
+
+checkPartitionPruningPredicate(df, false, false)
+checkAnswer(df, Nil)
+  }
+
   test("SPARK-37995: PlanAdaptiveDynamicPruningFilters should use 
prepareExecutedPlan " +
 "rather than createSparkPlan to re-plan subquery") {
 withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (5adcddb87a0 -> fe7b8fcd6fe)

2022-07-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5adcddb87a0 [SPARK-39695][SQL] Add the `REGEXP_SUBSTR` function
 add fe7b8fcd6fe [SPARK-33753][CORE] Reduce the memory footprint and gc of 
the cache (hadoopJobMetadata)

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/SparkEnv.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39695][SQL] Add the `REGEXP_SUBSTR` function

2022-07-07 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5adcddb87a0 [SPARK-39695][SQL] Add the `REGEXP_SUBSTR` function
5adcddb87a0 is described below

commit 5adcddb87a052ce8e3b3c917c61f019bea5532ae
Author: Max Gekk 
AuthorDate: Thu Jul 7 11:22:41 2022 +0300

[SPARK-39695][SQL] Add the `REGEXP_SUBSTR` function

### What changes were proposed in this pull request?
In the PR, I propose to add new expression `RegExpSubStr` as a runtime 
replaceable expression of `NullIf` and `RegExpExtract`. And bind the expression 
to the function name `REGEXP_SUBSTR`. The `REGEXP_SUBSTR` function returns the 
substring that matches a regular expression within a string. It takes two 
parameters:
1. An expression that specifies the string in which the search is to take 
place.
2. An expression that specifies the regular expression string that is the 
pattern for the search.

If the regular expression is not found, the result is **null** (this 
behaviour is similar to other DBMSs). When any of the input parameters are 
NULL, the function returns NULL too.

For example:
```sql
spark-sql> CREATE TABLE log (logs string);
spark-sql> INSERT INTO log (logs) VALUES
 > ('127.0.0.1 - - [10/Jan/2022:16:55:36 -0800] "GET / HTTP/1.0" 
200 2217'),
 > ('192.168.1.99 - - [14/Feb/2022:10:27:10 -0800] "GET 
/cgi-bin/try/ HTTP/1.0" 200 3396');
spark-sql> SELECT REGEXP_SUBSTR 
(logs,'\\b\\d{1,3}\.\\d{1,3}\.\\d{1,3}\.\\d{1,3}\\b') AS IP, REGEXP_SUBSTR 
(logs,'([\\w:\/]+\\s[+\-]\\d{4})') AS DATE FROM log;
127.0.0.1   10/Jan/2022:16:55:36 -0800
192.168.1.9914/Feb/2022:10:27:10 -0800
```

### Why are the changes needed?
To make the migration process from other systems to Spark SQL easier, and 
achieve feature parity to such systems. For example, the systems below support 
the `REGEXP_SUBSTR` function, see:
- Oracle: 
https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions116.htm
- DB2: https://www.ibm.com/docs/en/db2/11.5?topic=functions-regexp-substr
- Snowflake: 
https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html
- BigQuery: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_substr
- Redshift: 
https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_SUBSTR.html
- MariaDB: https://mariadb.com/kb/en/regexp_substr/
- Exasol DB: 
https://docs.exasol.com/db/latest/sql_references/functions/alphabeticallistfunctions/regexp_substr.htm

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running new tests:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z 
regexp-functions.sql"
$ build/sbt "sql/testOnly *ExpressionsSchemaSuite"
$ build/sbt "sql/test:testOnly 
org.apache.spark.sql.expressions.ExpressionInfoSuite"
```

Closes #37101 from MaxGekk/regexp_substr.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  1 +
 .../catalyst/expressions/regexpExpressions.scala   | 39 +++
 .../sql-functions/sql-expression-schema.md |  1 +
 .../sql-tests/inputs/regexp-functions.sql  |  9 
 .../sql-tests/results/regexp-functions.sql.out | 56 ++
 5 files changed, 106 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 52d84cfa175..20c719aec68 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -585,6 +585,7 @@ object FunctionRegistry {
 expression[XPathShort]("xpath_short"),
 expression[XPathString]("xpath_string"),
 expression[RegExpCount]("regexp_count"),
+expression[RegExpSubStr]("regexp_substr"),
 
 // datetime functions
 expression[AddMonths]("add_months"),
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
index 8d813058296..b240e849f4d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
@@ -1004,3 +1004,42 @@ case class RegExpCount(left: Expression, right: 
Expression)
   newChildren: IndexedSeq[Expression]): RegExpCount =
 copy(left = newChildren(0), right = newChildren(1))
 }
+
+// 

[spark] branch master updated: [SPARK-39689][SQL] Support 2-chars `lineSep` in CSV datasource

2022-07-07 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bb4c4778713 [SPARK-39689][SQL] Support 2-chars `lineSep` in CSV 
datasource
bb4c4778713 is described below

commit bb4c4778713c7ba1ee92d0bb0763d7d3ce54374f
Author: yaohua 
AuthorDate: Thu Jul 7 15:22:06 2022 +0900

[SPARK-39689][SQL] Support 2-chars `lineSep` in CSV datasource

### What changes were proposed in this pull request?
Univocity parser allows to set line separator to 1 to 2 characters 
([code](https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/Format.java#L103)),
 CSV options should not block this usage 
([code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala#L218)).
 This PR updates the requirement of `lineSep` accepting 1 or 2 characters in 
`CSVOptions`.

Due to the limitation of supporting multi-chars `lineSep` within quotes, I 
propose to leave this feature undocumented and add a WARN message to users.

### Why are the changes needed?
Unblock the usability of 2 characters `lineSep`.

### Does this PR introduce _any_ user-facing change?
No - undocumented feature.

### How was this patch tested?
New UT.

Closes #37107 from Yaohua628/spark-39689.

Lead-authored-by: yaohua 
Co-authored-by: Yaohua Zhao <79476540+yaohua...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/catalyst/csv/CSVOptions.scala |  6 +++-
 .../sql/execution/datasources/csv/CSVSuite.scala   | 35 ++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
index 9daa50ba5a4..3e92c3d25eb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
@@ -215,7 +215,11 @@ class CSVOptions(
*/
   val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
 require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-require(sep.length == 1, "'lineSep' can contain only 1 character.")
+// Intentionally allow it up to 2 for Window's CRLF although multiple
+// characters have an issue with quotes. This is intentionally 
undocumented.
+require(sep.length <= 2, "'lineSep' can contain only 1 character.")
+if (sep.length == 2) logWarning("It is not recommended to set 'lineSep' " +
+  "with 2 characters due to the limitation of supporting multi-char 
'lineSep' within quotes.")
 sep
   }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 62dccaad1dd..bf92ffcf465 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -34,6 +34,7 @@ import org.apache.commons.lang3.exception.ExceptionUtils
 import org.apache.commons.lang3.time.FastDateFormat
 import org.apache.hadoop.io.SequenceFile.CompressionType
 import org.apache.hadoop.io.compress.GzipCodec
+import org.apache.logging.log4j.Level
 
 import org.apache.spark.{SparkConf, SparkException, TestUtils}
 import org.apache.spark.sql.{AnalysisException, Column, DataFrame, Encoders, 
QueryTest, Row}
@@ -2296,6 +2297,40 @@ abstract class CSVSuite
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  Seq(true, false).foreach { multiLine =>
+test(s"""lineSep with 2 chars when multiLine set to $multiLine""") {
+  Seq("\r\n", "||", "|").foreach { newLine =>
+val logAppender = new LogAppender("lineSep WARN logger")
+withTempDir { dir =>
+  val inputData = if (multiLine) {
+s"""name,"i am the${newLine} 
column1"${newLine}jack,30${newLine}tom,18"""
+  } else {
+s"name,age${newLine}jack,30${newLine}tom,18"
+  }
+  Files.write(new File(dir, "/data.csv").toPath, inputData.getBytes())
+  withLogAppender(logAppender) {
+val df = spark.read
+  .options(
+Map("header" -> "true", "multiLine" -> multiLine.toString, 
"lineSep" -> newLine))
+  .csv(dir.getCanonicalPath)
+// Due to the limitation of Univocity parser:
+// multiple chars of newlines cannot be properly handled when they 
exist within quotes.
+// Leave 2-char lineSep as an undocumented features and logWarn 
user
+if (newLine 

[spark] branch master updated: [SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 for the changes in DeployMessages

2022-07-07 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 845950b72b6 [SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 
for the changes in DeployMessages
845950b72b6 is described below

commit 845950b72b63f94b03436a598d9d041e662a0b53
Author: Hyukjin Kwon 
AuthorDate: Thu Jul 7 15:21:25 2022 +0900

[SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 for the changes 
in DeployMessages

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/36716. Mima 
with Scala 2.13 complains about the changes in `DeployMessages` for some 
reasons:

```
[error] spark-core: Failed binary compatibility check against 
org.apache.spark:spark-core_2.13:3.2.0! Found 6 potential problems (filtered 
933)
[error]  * the type hierarchy of object 
org.apache.spark.deploy.DeployMessages#LaunchExecutor is different in current 
version. Missing types {scala.runtime.AbstractFunction7}
[error]filter with: 
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.deploy.DeployMessages$LaunchExecutor$")
[error]  * method requestedTotal()Int in class 
org.apache.spark.deploy.DeployMessages#RequestExecutors does not have a 
correspondent in current version
[error]filter with: 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.requestedTotal")
[error]  * method 
copy(java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors
 in class org.apache.spark.deploy.DeployMessages#RequestExecutors's type is 
different in current version, where it is 
(java.lang.String,scala.collection.immutable.Map)org.apache.spark.deploy.DeployMessages#RequestExecutors
 instead of 
(java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors
[error]filter with: 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.copy")
[error]  * synthetic method copy$default$2()Int in class 
org.apache.spark.deploy.DeployMessages#RequestExecutors has a different result 
type in current version, where it is scala.collection.immutable.Map rather than 
Int
[error]filter with: 
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.copy$default$2")
[error]  * method this(java.lang.String,Int)Unit in class 
org.apache.spark.deploy.DeployMessages#RequestExecutors's type is different in 
current version, where it is 
(java.lang.String,scala.collection.immutable.Map)Unit instead of 
(java.lang.String,Int)Unit
[error]filter with: 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.this")
[error]  * method 
apply(java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors
 in object org.apache.spark.deploy.DeployMessages#RequestExecutors in current 
version does not have a correspondent with same parameter signature among 
(java.lang.String,scala.collection.immutable.Map)org.apache.spark.deploy.DeployMessages#RequestExecutors,
 (java.lang.Object,java.lang.Object)java.lang.Object
[error]filter with: 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.apply")
```

https://github.com/apache/spark/runs/7221231391?check_suite_focus=true

This PR adds the suggested filters.

### Why are the changes needed?

To make the scheduled build (Scala 2.13) pass in 
https://github.com/apache/spark/actions/workflows/build_scala213.yml

### Does this PR introduce _any_ user-facing change?

No, dev-only. The alarms are false positive.

### How was this patch tested?

CI should verify this,

Closes #37109 from HyukjinKwon/SPARK-39703.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 project/MimaExcludes.scala | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index fb71155657f..3f3d8575477 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -54,7 +54,15 @@ object MimaExcludes {
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.Classifier.getNumClasses"),
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.Classifier.getNumClasses$default$2"),
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.OneVsRest.extractInstances"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.OneVsRestModel.extractInstances")
+

[spark] branch master updated: [SPARK-39679][SQL] TakeOrderedAndProjectExec should respect child output ordering

2022-07-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 427fbee4c00 [SPARK-39679][SQL] TakeOrderedAndProjectExec should 
respect child output ordering
427fbee4c00 is described below

commit 427fbee4c009d8d49fdb80a2e2532723eff84150
Author: ulysses-you 
AuthorDate: Thu Jul 7 14:20:29 2022 +0800

[SPARK-39679][SQL] TakeOrderedAndProjectExec should respect child output 
ordering

### What changes were proposed in this pull request?

Skip local sort in `TakeOrderedAndProjectExec` if child output ordering 
satisfies the required.

### Why are the changes needed?

TakeOrderedAndProjectExec should respect child output ordering to avoid 
unnecessary sort.
For example:  TakeOrderedAndProjectExec on the top of SortMergeJoin.
```SQL
SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100;
```

### Does this PR introduce _any_ user-facing change?

no, only improve performance

### How was this patch tested?

Add benchmark test:
```sql
val row = 10 * 1000
val df1 = spark.range(0, row, 1, 2).selectExpr("id % 3 as c1")
val df2 = spark.range(0, row, 1, 2).selectExpr("id % 3 as c2")
df1.join(df2, col("c1") === col("c2"))
  .orderBy(col("c1"))
  .limit(100)
  .noop()
```

Before:
```


TakeOrderedAndProject



OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU  2.80GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
TakeOrderedAndProject with SMJ for doExecute3356   
3414  61  0.0  335569.5   1.0X
TakeOrderedAndProject with SMJ for executeCollect   3331   
3370  47  0.0  333118.0   1.0X

OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
TakeOrderedAndProject with SMJ for doExecute3745   
3766  24  0.0  374477.3   1.0X
TakeOrderedAndProject with SMJ for executeCollect   3657   
3680  38  0.0  365703.4   1.0X

OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
TakeOrderedAndProject with SMJ for doExecute2499   
2554  47  0.0  249945.5   1.0X
TakeOrderedAndProject with SMJ for executeCollect   2510   
2515   8  0.0  250956.9   1.0X
```

After:
```


TakeOrderedAndProject



OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8171M CPU  2.60GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
TakeOrderedAndProject with SMJ for doExecute 287
337  43  0.0   28734.9   1.0X
TakeOrderedAndProject with SMJ for executeCollect150
170  30  0.1   15037.8   1.9X

OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative