[spark] branch master updated: [SPARK-39689][SQL] Support 2-chars `lineSep` in CSV datasource

2022-07-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bb4c4778713 [SPARK-39689][SQL] Support 2-chars `lineSep` in CSV 
datasource
bb4c4778713 is described below

commit bb4c4778713c7ba1ee92d0bb0763d7d3ce54374f
Author: yaohua 
AuthorDate: Thu Jul 7 15:22:06 2022 +0900

[SPARK-39689][SQL] Support 2-chars `lineSep` in CSV datasource

### What changes were proposed in this pull request?
Univocity parser allows to set line separator to 1 to 2 characters 
([code](https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/Format.java#L103)),
 CSV options should not block this usage 
([code](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala#L218)).
 This PR updates the requirement of `lineSep` accepting 1 or 2 characters in 
`CSVOptions`.

Due to the limitation of supporting multi-chars `lineSep` within quotes, I 
propose to leave this feature undocumented and add a WARN message to users.

### Why are the changes needed?
Unblock the usability of 2 characters `lineSep`.

### Does this PR introduce _any_ user-facing change?
No - undocumented feature.

### How was this patch tested?
New UT.

Closes #37107 from Yaohua628/spark-39689.

Lead-authored-by: yaohua 
Co-authored-by: Yaohua Zhao <79476540+yaohua...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/catalyst/csv/CSVOptions.scala |  6 +++-
 .../sql/execution/datasources/csv/CSVSuite.scala   | 35 ++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
index 9daa50ba5a4..3e92c3d25eb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
@@ -215,7 +215,11 @@ class CSVOptions(
*/
   val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
 require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
-require(sep.length == 1, "'lineSep' can contain only 1 character.")
+// Intentionally allow it up to 2 for Window's CRLF although multiple
+// characters have an issue with quotes. This is intentionally 
undocumented.
+require(sep.length <= 2, "'lineSep' can contain only 1 character.")
+if (sep.length == 2) logWarning("It is not recommended to set 'lineSep' " +
+  "with 2 characters due to the limitation of supporting multi-char 
'lineSep' within quotes.")
 sep
   }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 62dccaad1dd..bf92ffcf465 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -34,6 +34,7 @@ import org.apache.commons.lang3.exception.ExceptionUtils
 import org.apache.commons.lang3.time.FastDateFormat
 import org.apache.hadoop.io.SequenceFile.CompressionType
 import org.apache.hadoop.io.compress.GzipCodec
+import org.apache.logging.log4j.Level
 
 import org.apache.spark.{SparkConf, SparkException, TestUtils}
 import org.apache.spark.sql.{AnalysisException, Column, DataFrame, Encoders, 
QueryTest, Row}
@@ -2296,6 +2297,40 @@ abstract class CSVSuite
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  Seq(true, false).foreach { multiLine =>
+test(s"""lineSep with 2 chars when multiLine set to $multiLine""") {
+  Seq("\r\n", "||", "|").foreach { newLine =>
+val logAppender = new LogAppender("lineSep WARN logger")
+withTempDir { dir =>
+  val inputData = if (multiLine) {
+s"""name,"i am the${newLine} 
column1"${newLine}jack,30${newLine}tom,18"""
+  } else {
+s"name,age${newLine}jack,30${newLine}tom,18"
+  }
+  Files.write(new File(dir, "/data.csv").toPath, inputData.getBytes())
+  withLogAppender(logAppender) {
+val df = spark.read
+  .options(
+Map("header" -> "true", "multiLine" -> multiLine.toString, 
"lineSep" -> newLine))
+  .csv(dir.getCanonicalPath)
+// Due to the limitation of Univocity parser:
+// multiple chars of newlines cannot be properly handled when they 
exist within quotes.
+// Leave 2-char lineSep as an undocumented features and logWarn 
user
+if (newLine !=

[spark] branch master updated: [SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 for the changes in DeployMessages

2022-07-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 845950b72b6 [SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 
for the changes in DeployMessages
845950b72b6 is described below

commit 845950b72b63f94b03436a598d9d041e662a0b53
Author: Hyukjin Kwon 
AuthorDate: Thu Jul 7 15:21:25 2022 +0900

[SPARK-39703][CORE][BUILD] Mima complains with Scala 2.13 for the changes 
in DeployMessages

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/36716. Mima 
with Scala 2.13 complains about the changes in `DeployMessages` for some 
reasons:

```
[error] spark-core: Failed binary compatibility check against 
org.apache.spark:spark-core_2.13:3.2.0! Found 6 potential problems (filtered 
933)
[error]  * the type hierarchy of object 
org.apache.spark.deploy.DeployMessages#LaunchExecutor is different in current 
version. Missing types {scala.runtime.AbstractFunction7}
[error]filter with: 
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.deploy.DeployMessages$LaunchExecutor$")
[error]  * method requestedTotal()Int in class 
org.apache.spark.deploy.DeployMessages#RequestExecutors does not have a 
correspondent in current version
[error]filter with: 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.requestedTotal")
[error]  * method 
copy(java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors
 in class org.apache.spark.deploy.DeployMessages#RequestExecutors's type is 
different in current version, where it is 
(java.lang.String,scala.collection.immutable.Map)org.apache.spark.deploy.DeployMessages#RequestExecutors
 instead of 
(java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors
[error]filter with: 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.copy")
[error]  * synthetic method copy$default$2()Int in class 
org.apache.spark.deploy.DeployMessages#RequestExecutors has a different result 
type in current version, where it is scala.collection.immutable.Map rather than 
Int
[error]filter with: 
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.copy$default$2")
[error]  * method this(java.lang.String,Int)Unit in class 
org.apache.spark.deploy.DeployMessages#RequestExecutors's type is different in 
current version, where it is 
(java.lang.String,scala.collection.immutable.Map)Unit instead of 
(java.lang.String,Int)Unit
[error]filter with: 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.this")
[error]  * method 
apply(java.lang.String,Int)org.apache.spark.deploy.DeployMessages#RequestExecutors
 in object org.apache.spark.deploy.DeployMessages#RequestExecutors in current 
version does not have a correspondent with same parameter signature among 
(java.lang.String,scala.collection.immutable.Map)org.apache.spark.deploy.DeployMessages#RequestExecutors,
 (java.lang.Object,java.lang.Object)java.lang.Object
[error]filter with: 
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.apply")
```

https://github.com/apache/spark/runs/7221231391?check_suite_focus=true

This PR adds the suggested filters.

### Why are the changes needed?

To make the scheduled build (Scala 2.13) pass in 
https://github.com/apache/spark/actions/workflows/build_scala213.yml

### Does this PR introduce _any_ user-facing change?

No, dev-only. The alarms are false positive.

### How was this patch tested?

CI should verify this,

Closes #37109 from HyukjinKwon/SPARK-39703.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 project/MimaExcludes.scala | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index fb71155657f..3f3d8575477 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -54,7 +54,15 @@ object MimaExcludes {
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.Classifier.getNumClasses"),
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.Classifier.getNumClasses$default$2"),
 
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.OneVsRest.extractInstances"),
-
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.OneVsRestModel.extractInstances")
+
ProblemFilters.exclude[Dir

[spark] branch master updated: [SPARK-39679][SQL] TakeOrderedAndProjectExec should respect child output ordering

2022-07-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 427fbee4c00 [SPARK-39679][SQL] TakeOrderedAndProjectExec should 
respect child output ordering
427fbee4c00 is described below

commit 427fbee4c009d8d49fdb80a2e2532723eff84150
Author: ulysses-you 
AuthorDate: Thu Jul 7 14:20:29 2022 +0800

[SPARK-39679][SQL] TakeOrderedAndProjectExec should respect child output 
ordering

### What changes were proposed in this pull request?

Skip local sort in `TakeOrderedAndProjectExec` if child output ordering 
satisfies the required.

### Why are the changes needed?

TakeOrderedAndProjectExec should respect child output ordering to avoid 
unnecessary sort.
For example:  TakeOrderedAndProjectExec on the top of SortMergeJoin.
```SQL
SELECT * FROM t1 JOIN t2 ON t1.c1 = t2.c2 ORDER BY t1.c1 LIMIT 100;
```

### Does this PR introduce _any_ user-facing change?

no, only improve performance

### How was this patch tested?

Add benchmark test:
```sql
val row = 10 * 1000
val df1 = spark.range(0, row, 1, 2).selectExpr("id % 3 as c1")
val df2 = spark.range(0, row, 1, 2).selectExpr("id % 3 as c2")
df1.join(df2, col("c1") === col("c2"))
  .orderBy(col("c1"))
  .limit(100)
  .noop()
```

Before:
```


TakeOrderedAndProject



OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU  2.80GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
TakeOrderedAndProject with SMJ for doExecute3356   
3414  61  0.0  335569.5   1.0X
TakeOrderedAndProject with SMJ for executeCollect   3331   
3370  47  0.0  333118.0   1.0X

OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
TakeOrderedAndProject with SMJ for doExecute3745   
3766  24  0.0  374477.3   1.0X
TakeOrderedAndProject with SMJ for executeCollect   3657   
3680  38  0.0  365703.4   1.0X

OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
TakeOrderedAndProject with SMJ for doExecute2499   
2554  47  0.0  249945.5   1.0X
TakeOrderedAndProject with SMJ for executeCollect   2510   
2515   8  0.0  250956.9   1.0X
```

After:
```


TakeOrderedAndProject



OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8171M CPU  2.60GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-
TakeOrderedAndProject with SMJ for doExecute 287
337  43  0.0   28734.9   1.0X
TakeOrderedAndProject with SMJ for executeCollect150
170  30  0.1   15037.8   1.9X

OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8272CL CPU  2.60GHz
TakeOrderedAndProject with SMJ:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

-

[spark] branch branch-3.2 updated (1c0bd4c15a2 -> be891ad9908)

2022-07-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1c0bd4c15a2 [SPARK-39656][SQL][3.2] Fix wrong namespace in 
DescribeNamespaceExec
 add be891ad9908 [SPARK-39551][SQL][3.2] Add AQE invalid plan check

No new revisions were added by this update.

Summary of changes:
 .../execution/adaptive/AdaptiveSparkPlanExec.scala | 72 --
 .../adaptive/InvalidAQEPlanException.scala | 17 +++--
 .../sql/execution/adaptive/ValidateSparkPlan.scala | 68 
 .../adaptive/AdaptiveQueryExecSuite.scala  | 25 +++-
 4 files changed, 141 insertions(+), 41 deletions(-)
 copy core/src/main/scala/org/apache/spark/BarrierTaskInfo.scala => 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InvalidAQEPlanException.scala
 (61%)
 create mode 100644 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ValidateSparkPlan.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39503][SQL][FOLLOWUP] Fix ansi golden files and typo

2022-07-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 88b983d9f2a [SPARK-39503][SQL][FOLLOWUP] Fix ansi golden files and typo
88b983d9f2a is described below

commit 88b983d9f2a7190b8d74a6176740afb65fa08223
Author: ulysses-you 
AuthorDate: Thu Jul 7 13:17:11 2022 +0900

[SPARK-39503][SQL][FOLLOWUP] Fix ansi golden files and typo

### What changes were proposed in this pull request?

- re-generate ansi golden files
- fix FunctionIdentifier parameter name typo
### Why are the changes needed?

Fix ansi golden files and typo

### Does this PR introduce _any_ user-facing change?

no, not released

### How was this patch tested?

pass CI

Closes #37111 from ulysses-you/catalog-followup.

Authored-by: ulysses-you 
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/catalyst/identifiers.scala|  2 +-
 .../approved-plans-v1_4/q83.ansi/explain.txt   | 28 +++---
 .../approved-plans-v1_4/q83.ansi/simplified.txt| 14 +--
 .../approved-plans-v1_4/q83.sf100.ansi/explain.txt | 28 +++---
 .../q83.sf100.ansi/simplified.txt  | 14 +--
 5 files changed, 43 insertions(+), 43 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala
index 9cae2b622a7..2de44d6f349 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/identifiers.scala
@@ -142,7 +142,7 @@ case class FunctionIdentifier(funcName: String, database: 
Option[String], catalo
   override val identifier: String = funcName
 
   def this(funcName: String) = this(funcName, None, None)
-  def this(table: String, database: Option[String]) = this(table, database, 
None)
+  def this(funcName: String, database: Option[String]) = this(funcName, 
database, None)
 
   override def toString: String = unquotedString
 }
diff --git 
a/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.ansi/explain.txt
 
b/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.ansi/explain.txt
index d281e59c727..905d29293a3 100644
--- 
a/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.ansi/explain.txt
+++ 
b/sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.ansi/explain.txt
@@ -13,11 +13,11 @@ TakeOrderedAndProject (46)
   : :  :  +- * BroadcastHashJoin Inner BuildRight (8)
   : :  : :- * Filter (3)
   : :  : :  +- * ColumnarToRow (2)
-  : :  : : +- Scan parquet default.store_returns 
(1)
+  : :  : : +- Scan parquet 
spark_catalog.default.store_returns (1)
   : :  : +- BroadcastExchange (7)
   : :  :+- * Filter (6)
   : :  :   +- * ColumnarToRow (5)
-  : :  :  +- Scan parquet default.item (4)
+  : :  :  +- Scan parquet 
spark_catalog.default.item (4)
   : :  +- ReusedExchange (10)
   : +- BroadcastExchange (28)
   :+- * HashAggregate (27)
@@ -29,7 +29,7 @@ TakeOrderedAndProject (46)
   :   :  +- * BroadcastHashJoin Inner BuildRight (20)
   :   : :- * Filter (18)
   :   : :  +- * ColumnarToRow (17)
-  :   : : +- Scan parquet 
default.catalog_returns (16)
+  :   : : +- Scan parquet 
spark_catalog.default.catalog_returns (16)
   :   : +- ReusedExchange (19)
   :   +- ReusedExchange (22)
   +- BroadcastExchange (43)
@@ -42,12 +42,12 @@ TakeOrderedAndProject (46)
 :  +- * BroadcastHashJoin Inner BuildRight (35)
 : :- * Filter (33)
 : :  +- * ColumnarToRow (32)
-: : +- Scan parquet default.web_returns (31)
+: : +- Scan parquet 
spark_catalog.default.web_returns (31)
 : +- ReusedExchange (34)
 +- ReusedExchange (37)
 
 
-(1) Scan parquet default.store_returns
+(1) Scan parquet spark_catalog.default.store_returns
 Output [3]: [sr_item_sk#1, sr_return_quantity#2, sr_returned_date_sk#3]
 Batched: true
 Location: InMemoryFileIndex []
@@ -62,7 +62,7 @@ Input [3]: [sr_item_sk#1, sr_return_quantity#2, 
sr_returned_date_sk#3]
 Input [3]: [sr_item_sk#1, sr_return_qu

[spark] branch branch-3.3 updated: [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

2022-07-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 016dfeb760d [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, 
COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`
016dfeb760d is described below

commit 016dfeb760dbe1109e3c81c39bcd1bf3316a3e20
Author: Jiaan Geng 
AuthorDate: Thu Jul 7 09:55:45 2022 +0800

[SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR 
in `H2Dialect` if them with `DISTINCT`

https://github.com/apache/spark/pull/35145 compile COVAR_POP, COVAR_SAMP 
and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with 
DISTINCT.
So https://github.com/apache/spark/pull/35145 introduces a bug that compile 
COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate 
functions with DISTINCT.

'Yes'.
Bug will be fix.

New test cases.

Closes #37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 14f2bae208c093dea58e3f947fb660e8345fb256)
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  | 15 -
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 38 +++---
 2 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
index 4a88203ec59..967df112af2 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
@@ -55,18 +55,15 @@ private object H2Dialect extends JdbcDialect {
   assert(f.children().length == 1)
   val distinct = if (f.isDistinct) "DISTINCT " else ""
   Some(s"STDDEV_SAMP($distinct${f.children().head})")
-case f: GeneralAggregateFunc if f.name() == "COVAR_POP" =>
+case f: GeneralAggregateFunc if f.name() == "COVAR_POP" && 
!f.isDistinct =>
   assert(f.children().length == 2)
-  val distinct = if (f.isDistinct) "DISTINCT " else ""
-  Some(s"COVAR_POP($distinct${f.children().head}, 
${f.children().last})")
-case f: GeneralAggregateFunc if f.name() == "COVAR_SAMP" =>
+  Some(s"COVAR_POP(${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "COVAR_SAMP" && 
!f.isDistinct =>
   assert(f.children().length == 2)
-  val distinct = if (f.isDistinct) "DISTINCT " else ""
-  Some(s"COVAR_SAMP($distinct${f.children().head}, 
${f.children().last})")
-case f: GeneralAggregateFunc if f.name() == "CORR" =>
+  Some(s"COVAR_SAMP(${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "CORR" && !f.isDistinct =>
   assert(f.children().length == 2)
-  val distinct = if (f.isDistinct) "DISTINCT " else ""
-  Some(s"CORR($distinct${f.children().head}, ${f.children().last})")
+  Some(s"CORR(${f.children().head}, ${f.children().last})")
 case _ => None
   }
 )
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
index 2f94f9ef31e..293334084af 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
@@ -1028,23 +1028,37 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
   }
 
   test("scan with aggregate push-down: COVAR_POP COVAR_SAMP with filter and 
group by") {
-val df = sql("select COVAR_POP(bonus, bonus), COVAR_SAMP(bonus, bonus)" +
-  " FROM h2.test.employee where dept > 0 group by DePt")
-checkFiltersRemoved(df)
-checkAggregateRemoved(df)
-checkPushedInfo(df, "PushedAggregates: [COVAR_POP(BONUS, BONUS), 
COVAR_SAMP(BONUS, BONUS)], " +
+val df1 = sql("SELECT COVAR_POP(bonus, bonus), COVAR_SAMP(bonus, bonus)" +
+  " FROM h2.test.employee WHERE dept > 0 GROUP BY DePt")
+checkFiltersRemoved(df1)
+checkAggregateRemoved(df1)
+checkPushedInfo(df1, "PushedAggregates: [COVAR_POP(BONUS, BONUS), 
COVAR_SAMP(BONUS, BONUS)], " +
   "PushedFilters: [DEPT IS NOT NULL, DEPT > 0], PushedGroupByExpressions: 
[DEPT]")
-checkAnswer(df, Seq(Row(1d, 2d), Row(2500d, 5000d), Row(0d, null)))
+checkAnswer(df1, Seq(Row(1d, 2d), Row(2500d, 5000d), Row(0d, 
null)))
+
+val df2 = sql("SELECT COVAR_POP(DISTINCT bonus, bonus), 
COVAR_SAMP(DISTINCT bonus, bonus)" +
+  " FROM h2.test.employee WHERE dept > 0 

[spark] branch master updated: [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`

2022-07-06 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 14f2bae208c [SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, 
COVAR_SAMP and CORR in `H2Dialect` if them with `DISTINCT`
14f2bae208c is described below

commit 14f2bae208c093dea58e3f947fb660e8345fb256
Author: Jiaan Geng 
AuthorDate: Thu Jul 7 09:55:45 2022 +0800

[SPARK-37527][SQL][FOLLOWUP] Cannot compile COVAR_POP, COVAR_SAMP and CORR 
in `H2Dialect` if them with `DISTINCT`

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/35145 compile COVAR_POP, COVAR_SAMP 
and CORR in H2Dialect.
Because H2 does't support COVAR_POP, COVAR_SAMP and CORR works with 
DISTINCT.
So https://github.com/apache/spark/pull/35145 introduces a bug that compile 
COVAR_POP, COVAR_SAMP and CORR if these aggregate functions with DISTINCT.

### Why are the changes needed?
Fix bug that compile COVAR_POP, COVAR_SAMP and CORR if these aggregate 
functions with DISTINCT.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Bug will be fix.

### How was this patch tested?
New test cases.

Closes #37090 from beliefer/SPARK-37527_followup2.

Authored-by: Jiaan Geng 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/jdbc/H2Dialect.scala  | 15 --
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 34 +++---
 2 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
index 124cb001b5c..5dfc64d7b6c 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala
@@ -62,18 +62,15 @@ private[sql] object H2Dialect extends JdbcDialect {
   assert(f.children().length == 1)
   val distinct = if (f.isDistinct) "DISTINCT " else ""
   Some(s"STDDEV_SAMP($distinct${f.children().head})")
-case f: GeneralAggregateFunc if f.name() == "COVAR_POP" =>
+case f: GeneralAggregateFunc if f.name() == "COVAR_POP" && 
!f.isDistinct =>
   assert(f.children().length == 2)
-  val distinct = if (f.isDistinct) "DISTINCT " else ""
-  Some(s"COVAR_POP($distinct${f.children().head}, 
${f.children().last})")
-case f: GeneralAggregateFunc if f.name() == "COVAR_SAMP" =>
+  Some(s"COVAR_POP(${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "COVAR_SAMP" && 
!f.isDistinct =>
   assert(f.children().length == 2)
-  val distinct = if (f.isDistinct) "DISTINCT " else ""
-  Some(s"COVAR_SAMP($distinct${f.children().head}, 
${f.children().last})")
-case f: GeneralAggregateFunc if f.name() == "CORR" =>
+  Some(s"COVAR_SAMP(${f.children().head}, ${f.children().last})")
+case f: GeneralAggregateFunc if f.name() == "CORR" && !f.isDistinct =>
   assert(f.children().length == 2)
-  val distinct = if (f.isDistinct) "DISTINCT " else ""
-  Some(s"CORR($distinct${f.children().head}, ${f.children().last})")
+  Some(s"CORR(${f.children().head}, ${f.children().last})")
 case _ => None
   }
 )
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
index 108348fbcd3..0a713bdb76c 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
@@ -1652,23 +1652,37 @@ class JDBCV2Suite extends QueryTest with 
SharedSparkSession with ExplainSuiteHel
   }
 
   test("scan with aggregate push-down: COVAR_POP COVAR_SAMP with filter and 
group by") {
-val df = sql("SELECT COVAR_POP(bonus, bonus), COVAR_SAMP(bonus, bonus)" +
+val df1 = sql("SELECT COVAR_POP(bonus, bonus), COVAR_SAMP(bonus, bonus)" +
   " FROM h2.test.employee WHERE dept > 0 GROUP BY DePt")
-checkFiltersRemoved(df)
-checkAggregateRemoved(df)
-checkPushedInfo(df, "PushedAggregates: [COVAR_POP(BONUS, BONUS), 
COVAR_SAMP(BONUS, BONUS)], " +
+checkFiltersRemoved(df1)
+checkAggregateRemoved(df1)
+checkPushedInfo(df1, "PushedAggregates: [COVAR_POP(BONUS, BONUS), 
COVAR_SAMP(BONUS, BONUS)], " +
   "PushedFilters: [DEPT IS NOT NULL, DEPT > 0], PushedGroupByExpressions: 
[DEPT]")
-checkAnswer(df, Seq(Row(1d, 2d), Row(2500d, 5000d), Row(0d, null)))
+checkAnswer(df1, Seq(Row(1d, 2d), Row(2500d, 5000d), Row(0d, 
null)))
+
+val df2 = sql("SELECT COVAR_POP(DISTINCT bonus, bonus), 
COVAR_SAMP(DISTINCT bonus, bonus)" +
+  " FROM h2.test.employee

[spark] branch master updated: [SPARK-39697][INFRA] Add REFRESH_DATE flag and use previous cache to build cache image

2022-07-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2a2608ad557 [SPARK-39697][INFRA] Add REFRESH_DATE flag and use 
previous cache to build cache image
2a2608ad557 is described below

commit 2a2608ad557e3ebb160287b7d7fd9d14c251b3c2
Author: Yikun Jiang 
AuthorDate: Thu Jul 7 08:59:38 2022 +0900

[SPARK-39697][INFRA] Add REFRESH_DATE flag and use previous cache to build 
cache image

### What changes were proposed in this pull request?
This patch have two improvment:
- Add `cache-from`: this will help to speed up cache build and ensure the 
image will NOT do full refresh if `REFRESH_DATE` is not changed by intention.
- Add `FULL_REFRESH_DATE` in dockerfile: this will help force to do a full 
refresh.

### Why are the changes needed?
Without this PR, if you change the dockerfile, the cache image will do a 
**complete refreshed** when dockerfile with any changes. This cause the 
different behavoir between ci tmp image (cache based refresh, in 
pyspark/sparkr/lint job) and infra cache (full refresh, in build infra cache 
job).
Finally, if a PR refresh dockerfile, you might see pyspark/sparkr/lint CI 
is successful, but next pyspark/sparkr/lint CI failure after cache is refreshed 
(because deps may be changed when image do full refreshed).

After this PR, if you change the dockerfile, the cache image job will do a 
cache based refreshed (use previous cache as much as possible, and refreshed 
the left layers when cache mismatch) to keep same behavior of 
pyspark/sparkr/lint job result.

This behavior is similar to **static image** in some level, you can refresh 
the `FULL_REFRESH_DATE` to force refresh cache completely, the advantage is you 
can see the pyspark/sparkr/lint ci results in GA when you do full refresh.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test local

Closes #37103 from Yikun/SPARK-39522-FOLLOWUP.

Authored-by: Yikun Jiang 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_infra_images_cache.yml | 1 +
 dev/infra/Dockerfile   | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/.github/workflows/build_infra_images_cache.yml 
b/.github/workflows/build_infra_images_cache.yml
index 4ab27da7bdf..145769d1506 100644
--- a/.github/workflows/build_infra_images_cache.yml
+++ b/.github/workflows/build_infra_images_cache.yml
@@ -57,6 +57,7 @@ jobs:
   context: ./dev/infra/
   push: true
   tags: 
ghcr.io/apache/spark/apache-spark-github-action-image-cache:${{ github.ref_name 
}}
+  cache-from: 
type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-cache:${{
 github.ref_name }}
   cache-to: 
type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-cache:${{
 github.ref_name }},mode=max
   -
 name: Image digest
diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index 8968b097251..e3ba4f6110b 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -18,6 +18,8 @@
 # Image for building and testing Spark branches. Based on Ubuntu 20.04.
 FROM ubuntu:20.04
 
+ENV FULL_REFRESH_DATE 20220706
+
 ENV DEBIAN_FRONTEND noninteractive
 ENV DEBCONF_NONINTERACTIVE_SEEN true
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39648][PYTHON][PS][DOC] Fix type hints of `like`, `rlike`, `ilike` of Column

2022-07-06 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new be7dab12677 [SPARK-39648][PYTHON][PS][DOC] Fix type hints of `like`, 
`rlike`, `ilike` of Column
be7dab12677 is described below

commit be7dab12677a180908b6ce37847abdda12adeb9b
Author: Xinrong Meng 
AuthorDate: Thu Jul 7 08:53:50 2022 +0900

[SPARK-39648][PYTHON][PS][DOC] Fix type hints of `like`, `rlike`, `ilike` 
of Column

### What changes were proposed in this pull request?
Fix type hints of `like`, `rlike`, `ilike` of Column.

### Why are the changes needed?
Current type hints are incorrect so the doc is confusing: `Union["Column", 
"LiteralType", "DecimalLiteral", "DateTimeLiteral"]]` is hinted whereas only 
`str` is accepted.

The PR is proposed to adjust the above issue by introducing 
`_bin_op_other_str`.

### Does this PR introduce _any_ user-facing change?
No. Doc change only.

### How was this patch tested?
Manual tests.

Closes #37038 from xinrong-databricks/like_rlike.

Authored-by: Xinrong Meng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/series.py |   2 +-
 python/pyspark/sql/column.py| 117 +---
 2 files changed, 64 insertions(+), 55 deletions(-)

diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index a7852c110f7..838077ed7cd 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -5024,7 +5024,7 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
 else:
 if regex:
 # to_replace must be a string
-cond = self.spark.column.rlike(to_replace)
+cond = self.spark.column.rlike(cast(str, to_replace))
 else:
 cond = self.spark.column.isin(to_replace)
 # to_replace may be a scalar
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index 04458d560ee..31954a95690 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -573,57 +573,6 @@ class Column:
 >>> df.filter(df.name.contains('o')).collect()
 [Row(age=5, name='Bob')]
 """
-_rlike_doc = """
-SQL RLIKE expression (LIKE with Regex). Returns a boolean :class:`Column` 
based on a regex
-match.
-
-Parameters
---
-other : str
-an extended regex expression
-
-Examples
-
->>> df.filter(df.name.rlike('ice$')).collect()
-[Row(age=2, name='Alice')]
-"""
-_like_doc = """
-SQL like expression. Returns a boolean :class:`Column` based on a SQL LIKE 
match.
-
-Parameters
---
-other : str
-a SQL LIKE pattern
-
-See Also
-
-pyspark.sql.Column.rlike
-
-Examples
-
->>> df.filter(df.name.like('Al%')).collect()
-[Row(age=2, name='Alice')]
-"""
-_ilike_doc = """
-SQL ILIKE expression (case insensitive LIKE). Returns a boolean 
:class:`Column`
-based on a case insensitive match.
-
-.. versionadded:: 3.3.0
-
-Parameters
---
-other : str
-a SQL LIKE pattern
-
-See Also
-
-pyspark.sql.Column.rlike
-
-Examples
-
->>> df.filter(df.name.ilike('%Ice')).collect()
-[Row(age=2, name='Alice')]
-"""
 _startswith_doc = """
 String starts with. Returns a boolean :class:`Column` based on a string 
match.
 
@@ -656,12 +605,72 @@ class Column:
 """
 
 contains = _bin_op("contains", _contains_doc)
-rlike = _bin_op("rlike", _rlike_doc)
-like = _bin_op("like", _like_doc)
-ilike = _bin_op("ilike", _ilike_doc)
 startswith = _bin_op("startsWith", _startswith_doc)
 endswith = _bin_op("endsWith", _endswith_doc)
 
+def like(self: "Column", other: str) -> "Column":
+"""
+SQL like expression. Returns a boolean :class:`Column` based on a SQL 
LIKE match.
+
+Parameters
+--
+other : str
+a SQL LIKE pattern
+
+See Also
+
+pyspark.sql.Column.rlike
+
+Examples
+
+>>> df.filter(df.name.like('Al%')).collect()
+[Row(age=2, name='Alice')]
+"""
+njc = getattr(self._jc, "like")(other)
+return Column(njc)
+
+def rlike(self: "Column", other: str) -> "Column":
+"""
+SQL RLIKE expression (LIKE with Regex). Returns a boolean 
:class:`Column` based on a regex
+match.
+
+Parameters
+--
+other : str
+an extended regex expression
+
+Examples
+
+>>> df.filter(df.name.rlike('ice$')).collect()
+[Row(age=2, name='Alice')]
+"""
+njc = getattr(self._jc, "rlike")(othe

[spark] branch master updated: [SPARK-39701][CORE][K8S][TESTS] Move `withSecretFile` to `SparkFunSuite` to reuse

2022-07-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1cf4fe5cd4d [SPARK-39701][CORE][K8S][TESTS] Move `withSecretFile` to 
`SparkFunSuite` to reuse
1cf4fe5cd4d is described below

commit 1cf4fe5cd4dedd6ccd38fc9c159069f7c5a72191
Author: Dongjoon Hyun 
AuthorDate: Wed Jul 6 16:27:58 2022 -0700

[SPARK-39701][CORE][K8S][TESTS] Move `withSecretFile` to `SparkFunSuite` to 
reuse

### What changes were proposed in this pull request?

This PR aims to move `withSecretFile` to `SparkFunSuite` and reuse it in 
Kubernetes tests.

### Why are the changes needed?

Currently, K8s unit tests generate a leftover because it doesn't clean up 
the temporary secret files. By reusing the existing method, we can avoid this
```
$ build/sbt -Pkubernetes "kubernetes/test"
$ git status
On branch master
Your branch is up to date with 'apache/master'.

Untracked files:
  (use "git add ..." to include in what will be committed)
resource-managers/kubernetes/core/temp-secret/
```

### Does this PR introduce _any_ user-facing change?

No. This is a test-only change.

### How was this patch tested?

Pass the CIs.

Closes #37106 from dongjoon-hyun/SPARK-39701.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/SecurityManagerSuite.scala| 11 +---
 .../scala/org/apache/spark/SparkFunSuite.scala | 16 ++-
 .../features/BasicExecutorFeatureStepSuite.scala   | 31 +-
 3 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala 
b/core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala
index 44e338c6f00..a11ecc22d0b 100644
--- a/core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/SecurityManagerSuite.scala
@@ -29,7 +29,7 @@ import org.apache.spark.internal.config._
 import org.apache.spark.internal.config.UI._
 import org.apache.spark.launcher.SparkLauncher
 import org.apache.spark.security.GroupMappingServiceProvider
-import org.apache.spark.util.{ResetSystemProperties, SparkConfWithEnv, Utils}
+import org.apache.spark.util.{ResetSystemProperties, SparkConfWithEnv}
 
 class DummyGroupMappingServiceProvider extends GroupMappingServiceProvider {
 
@@ -513,14 +513,5 @@ class SecurityManagerSuite extends SparkFunSuite with 
ResetSystemProperties {
   private def encodeFileAsBase64(secretFile: File) = {
 Base64.getEncoder.encodeToString(Files.readAllBytes(secretFile.toPath))
   }
-
-  private def withSecretFile(contents: String = "test-secret")(f: File => 
Unit): Unit = {
-val secretDir = Utils.createTempDir("temp-secrets")
-val secretFile = new File(secretDir, "temp-secret.txt")
-Files.write(secretFile.toPath, contents.getBytes(UTF_8))
-try f(secretFile) finally {
-  Utils.deleteRecursively(secretDir)
-}
-  }
 }
 
diff --git a/core/src/test/scala/org/apache/spark/SparkFunSuite.scala 
b/core/src/test/scala/org/apache/spark/SparkFunSuite.scala
index 7922e13db69..b17aacc0a9f 100644
--- a/core/src/test/scala/org/apache/spark/SparkFunSuite.scala
+++ b/core/src/test/scala/org/apache/spark/SparkFunSuite.scala
@@ -18,7 +18,8 @@
 package org.apache.spark
 
 import java.io.File
-import java.nio.file.Path
+import java.nio.charset.StandardCharsets.UTF_8
+import java.nio.file.{Files, Path}
 import java.util.{Locale, TimeZone}
 
 import scala.annotation.tailrec
@@ -223,6 +224,19 @@ abstract class SparkFunSuite
 }
   }
 
+  /**
+   * Creates a temporary directory containing a secret file, which is then 
passed to `f` and
+   * will be deleted after `f` returns.
+   */
+  protected def withSecretFile(contents: String = "test-secret")(f: File => 
Unit): Unit = {
+val secretDir = Utils.createTempDir("temp-secrets")
+val secretFile = new File(secretDir, "temp-secret.txt")
+Files.write(secretFile.toPath, contents.getBytes(UTF_8))
+try f(secretFile) finally {
+  Utils.deleteRecursively(secretDir)
+}
+  }
+
   /**
* Adds a log appender and optionally sets a log level to the root logger or 
the logger with
* the specified name, then executes the specified function, and in the end 
removes the log
diff --git 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStepSuite.scala
 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStepSuite.scala
index 84c4f3b8ba3..420edddb693 100644
--- 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStepSuite.scala
+++ 
b/resource-managers/kubernetes/core/src/test/scala/org/apa

[GitHub] [spark-website] holdenk commented on pull request #400: [SPARK-39512] Document docker image release steps

2022-07-06 Thread GitBox


holdenk commented on PR #400:
URL: https://github.com/apache/spark-website/pull/400#issuecomment-1176618299

   ping @MaxGekk & @tgravescs @gengliangwang since y'all had comments on the 
first draft, this one looking ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39663][SQL][TESTS] Add UT for MysqlDialect listIndexes method

2022-07-06 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9983bdb3b88 [SPARK-39663][SQL][TESTS] Add UT for MysqlDialect 
listIndexes method
9983bdb3b88 is described below

commit 9983bdb3b882a083cba9785392c3ba5d7a36496a
Author: panbingkun 
AuthorDate: Wed Jul 6 11:27:17 2022 -0500

[SPARK-39663][SQL][TESTS] Add UT for MysqlDialect listIndexes method

### What changes were proposed in this pull request?
Add complemented UT for MysqlDialect's lustIndexes method.

### Why are the changes needed?
Add UT for existed function & improve test coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #37060 from panbingkun/SPARK-39663.

Authored-by: panbingkun 
Signed-off-by: Sean Owen 
---
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  |  2 ++
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 30 ++
 .../org/apache/spark/sql/jdbc/MySQLDialect.scala   |  2 +-
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala
index 97f521a378e..6e76b74c7d8 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala
@@ -119,6 +119,8 @@ class MySQLIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTest
 
   override def supportsIndex: Boolean = true
 
+  override def supportListIndexes: Boolean = true
+
   override def indexOptions: String = "KEY_BLOCK_SIZE=10"
 
   testVarPop()
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
index 5f0033490d5..0f85bd534c3 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
@@ -197,6 +197,8 @@ private[v2] trait V2JDBCTest extends SharedSparkSession 
with DockerIntegrationFu
 
   def supportsIndex: Boolean = false
 
+  def supportListIndexes: Boolean = false
+
   def indexOptions: String = ""
 
   test("SPARK-36895: Test INDEX Using SQL") {
@@ -219,11 +221,21 @@ private[v2] trait V2JDBCTest extends SharedSparkSession 
with DockerIntegrationFu
   s" The supported Index Types are:"))
 
 sql(s"CREATE index i1 ON $catalogName.new_table USING BTREE (col1)")
+assert(jdbcTable.indexExists("i1"))
+if (supportListIndexes) {
+  val indexes = jdbcTable.listIndexes()
+  assert(indexes.size == 1)
+  assert(indexes.head.indexName() == "i1")
+}
+
 sql(s"CREATE index i2 ON $catalogName.new_table (col2, col3, col5)" +
   s" OPTIONS ($indexOptions)")
-
-assert(jdbcTable.indexExists("i1") == true)
-assert(jdbcTable.indexExists("i2") == true)
+assert(jdbcTable.indexExists("i2"))
+if (supportListIndexes) {
+  val indexes = jdbcTable.listIndexes()
+  assert(indexes.size == 2)
+  assert(indexes.map(_.indexName()).sorted === Array("i1", "i2"))
+}
 
 // This should pass without exception
 sql(s"CREATE index IF NOT EXISTS i1 ON $catalogName.new_table (col1)")
@@ -234,10 +246,18 @@ private[v2] trait V2JDBCTest extends SharedSparkSession 
with DockerIntegrationFu
 assert(m.contains("Failed to create index i1 in new_table"))
 
 sql(s"DROP index i1 ON $catalogName.new_table")
-sql(s"DROP index i2 ON $catalogName.new_table")
-
 assert(jdbcTable.indexExists("i1") == false)
+if (supportListIndexes) {
+  val indexes = jdbcTable.listIndexes()
+  assert(indexes.size == 1)
+  assert(indexes.head.indexName() == "i2")
+}
+
+sql(s"DROP index i2 ON $catalogName.new_table")
 assert(jdbcTable.indexExists("i2") == false)
+if (supportListIndexes) {
+  assert(jdbcTable.listIndexes().isEmpty)
+}
 
 // This should pass without exception
 sql(s"DROP index IF EXISTS i1 ON $catalogName.new_table")
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala
index 24f9bac74f8..c4cb5369af9 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala
+++ b/sql/core/src/main/scala/org

[spark] branch master updated: [SPARK-39691][TESTS] Update `MapStatusesConvertBenchmark` result files

2022-07-06 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3cf07c3e031 [SPARK-39691][TESTS] Update `MapStatusesConvertBenchmark` 
result files
3cf07c3e031 is described below

commit 3cf07c3e03195b95a37d3635f736fe78b70d22f7
Author: yangjie01 
AuthorDate: Wed Jul 6 09:23:37 2022 -0700

[SPARK-39691][TESTS] Update `MapStatusesConvertBenchmark` result files

### What changes were proposed in this pull request?
SPARK-39325 add `MapStatusesConvertBenchmark` but only upload result file 
generated by Java 8, so this pr supplement `MapStatusesConvertBenchmark` result 
generated by Java 11 and 17.

On the other hand, SPARK-39626 upgraded `RoaringBitmap` from `0.9.28` to 
`0.9.30` and from the `IntelliJ Profiler` sampling, the hotspot path of 
`MapStatusesConvertBenchmark` contains `RoaringBitmap#contains`, so this pr 
also updated the result file generated by Java 8.

### Why are the changes needed?
Update `MapStatusesConvertBenchmark` result files

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #37100 from LuciferYang/MapStatusesConvertBenchmark-result.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../MapStatusesConvertBenchmark-jdk11-results.txt   | 13 +
 ...ts.txt => MapStatusesConvertBenchmark-jdk17-results.txt} |  8 
 core/benchmarks/MapStatusesConvertBenchmark-results.txt | 10 +-
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt 
b/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt
new file mode 100644
index 000..96fa24175c5
--- /dev/null
+++ b/core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt
@@ -0,0 +1,13 @@
+
+MapStatuses Convert Benchmark
+
+
+OpenJDK 64-Bit Server VM 11.0.15+10-LTS on Linux 5.13.0-1031-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
+MapStatuses Convert:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+
+Num Maps: 5 Fetch partitions:500   1324   1333 
  7  0.0  1324283680.0   1.0X
+Num Maps: 5 Fetch partitions:1000  2650   2670 
 32  0.0  2650318387.0   0.5X
+Num Maps: 5 Fetch partitions:1500  4018   4059 
 53  0.0  4017921009.0   0.3X
+
+
diff --git a/core/benchmarks/MapStatusesConvertBenchmark-results.txt 
b/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt
similarity index 54%
copy from core/benchmarks/MapStatusesConvertBenchmark-results.txt
copy to core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt
index f41401bbe2e..0ba8d756dfc 100644
--- a/core/benchmarks/MapStatusesConvertBenchmark-results.txt
+++ b/core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt
@@ -2,12 +2,12 @@
 MapStatuses Convert Benchmark
 

 
-OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1025-azure
+OpenJDK 64-Bit Server VM 17.0.3+7-LTS on Linux 5.13.0-1031-azure
 Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
 MapStatuses Convert:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Num Maps: 5 Fetch partitions:500   1330   1359 
 26  0.0  1329827185.0   1.0X
-Num Maps: 5 Fetch partitions:1000  2648   2666 
 20  0.0  2647944453.0   0.5X
-Num Maps: 5 Fetch partitions:1500  4155   4436 
383  0.0  4154563448.0   0.3X
+Num Maps: 5 Fetch partitions:500   1092   1104 
 22  0.0  1091691925.0   1.0X
+Num Maps: 5 Fetch partitions:1000  2172   2192 
 29  0.0  2171702137.0   0.5X
+Num Maps: 5 Fetch partitions:1500  3268   3291 
 27  0.0  3267904436.0   0.3X
 
 
diff --git a/core/benchmarks/MapStatusesConvertBenchmark-results.txt 
b/core/benchmarks/MapStatusesConvertBenchmark-results.txt
index f41401bbe2e..ae84abfdcc2 100644
--- a/core/benchmarks/MapStatusesConvertBenchmark-results