[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-31 Thread GitBox


SandishKumarHN commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1010092402


##
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala:
##
@@ -177,47 +175,75 @@ private[sql] object ProtobufUtils extends Logging {
   .asInstanceOf[Descriptor]
   }
 
+  // TODO: Revisit to ensure that messageName is searched through all imports
   def buildDescriptor(descFilePath: String, messageName: String): Descriptor = 
{
-val descriptor = 
parseFileDescriptor(descFilePath).getMessageTypes.asScala.find { desc =>
-  desc.getName == messageName || desc.getFullName == messageName
+val descriptorList = parseFileDescriptor(descFilePath).map(fileDescriptor 
=> {

Review Comment:
   @rangadi makes sense to use find and return, fixed.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1010090254


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -217,3 +218,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;

Review Comment:
   Yes let me follow up. I guess I was looking at python side API somehow thus 
confused myself on the types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions`

2022-10-31 Thread GitBox


amaliujia commented on PR #38423:
URL: https://github.com/apache/spark/pull/38423#issuecomment-1298056172

   Late LGTM thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38460:
URL: https://github.com/apache/spark/pull/38460#discussion_r1010087722


##
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##
@@ -118,6 +118,15 @@ def test_relation_alias(self):
 plan = df.alias("table_alias")._plan.to_proto(self.connect)
 self.assertEqual(plan.root.common.alias, "table_alias")
 
+def test_range(self):
+plan = self.connect.range(start=10, end=20, step=3, 
num_partitions=4)._plan.to_proto(
+self.connect
+)
+self.assertEqual(plan.root.range.start, 10)
+self.assertEqual(plan.root.range.end, 20)
+self.assertEqual(plan.root.range.step.step, 3)
+self.assertEqual(plan.root.range.num_partitions.num_partitions, 4)
+

Review Comment:
   I added this test case but only test `step` and `num_partitions` is not set.
   
   Right now there is a division between client and server such that:
   1. Client take care of required fields, meaning that clients need to make 
sure the required fields are set.
   2. Server side take care of default values for optional fields. 
   
   This is to reduce load for both sides of implementation:
   1. clients do not need to worry about default values for optional fields 
unless the default value is exposed on the DataFrame API already.
   2. Server side do not care for whether required field is set (clients 
enforce it) but server side tracks the default value for optional fields. This 
can also avoid that clients side to set different default value. The default 
values are documented in proto: 
https://github.com/apache/spark/blob/fb64041cea3e094b0807cb580deacc721b302408/connector/connect/src/main/protobuf/spark/connect/relations.proto#L232
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


zhengruifeng commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1010084761


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -217,3 +218,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;

Review Comment:
   `start`, `end`, `step` should use `int64` @amaliujia 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38460:
URL: https://github.com/apache/spark/pull/38460#discussion_r1010084257


##
python/pyspark/sql/connect/client.py:
##
@@ -145,6 +145,39 @@ def _build_metrics(self, metrics: "pb2.Response.Metrics") 
-> typing.List[PlanMet
 def sql(self, sql_string: str) -> "DataFrame":
 return DataFrame.withPlan(SQL(sql_string), self)
 
+def range(
+self,
+start: int,
+end: int,
+step: Optional[int] = None,

Review Comment:
   hmmm we are not marking `step` as required because Scala side implementation 
does not treat it as required and thus it has also a default value.
   
   ```
 private def transformRange(rel: proto.Range): LogicalPlan = {
   val start = rel.getStart
   val end = rel.getEnd
   val step = if (rel.hasStep) {
 rel.getStep.getStep
   } else {
 1
   }
   val numPartitions = if (rel.hasNumPartitions) {
 rel.getNumPartitions.getNumPartitions
   } else {
 session.leafNodeDefaultParallelism
   }
   logical.Range(start, end, step, numPartitions)
 }
   ```
   
   Same for `numPartitions`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38460:
URL: https://github.com/apache/spark/pull/38460#discussion_r1010084257


##
python/pyspark/sql/connect/client.py:
##
@@ -145,6 +145,39 @@ def _build_metrics(self, metrics: "pb2.Response.Metrics") 
-> typing.List[PlanMet
 def sql(self, sql_string: str) -> "DataFrame":
 return DataFrame.withPlan(SQL(sql_string), self)
 
+def range(
+self,
+start: int,
+end: int,
+step: Optional[int] = None,

Review Comment:
   hmmm we are not marking `step` as required because Scala side implementation 
does not treat it as required and it has a default value.
   
   ```
 private def transformRange(rel: proto.Range): LogicalPlan = {
   val start = rel.getStart
   val end = rel.getEnd
   val step = if (rel.hasStep) {
 rel.getStep.getStep
   } else {
 1
   }
   val numPartitions = if (rel.hasNumPartitions) {
 rel.getNumPartitions.getNumPartitions
   } else {
 session.leafNodeDefaultParallelism
   }
   logical.Range(start, end, step, numPartitions)
 }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client

2022-10-31 Thread GitBox


zhengruifeng commented on code in PR #38460:
URL: https://github.com/apache/spark/pull/38460#discussion_r1010078740


##
python/pyspark/sql/connect/client.py:
##
@@ -145,6 +145,39 @@ def _build_metrics(self, metrics: "pb2.Response.Metrics") 
-> typing.List[PlanMet
 def sql(self, sql_string: str) -> "DataFrame":
 return DataFrame.withPlan(SQL(sql_string), self)
 
+def range(
+self,
+start: int,
+end: int,
+step: Optional[int] = None,

Review Comment:
   i think we can use `step: int = 1`



##
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##
@@ -118,6 +118,15 @@ def test_relation_alias(self):
 plan = df.alias("table_alias")._plan.to_proto(self.connect)
 self.assertEqual(plan.root.common.alias, "table_alias")
 
+def test_range(self):
+plan = self.connect.range(start=10, end=20, step=3, 
num_partitions=4)._plan.to_proto(
+self.connect
+)
+self.assertEqual(plan.root.range.start, 10)
+self.assertEqual(plan.root.range.end, 20)
+self.assertEqual(plan.root.range.step.step, 3)
+self.assertEqual(plan.root.range.num_partitions.num_partitions, 4)
+

Review Comment:
   what about adding a new case like `range(start=10, end=20)` and check:
   1, step is set 1 (if we make the default value 1);
   2, num_partitions not set;



##
python/pyspark/sql/connect/client.py:
##
@@ -145,6 +145,39 @@ def _build_metrics(self, metrics: "pb2.Response.Metrics") 
-> typing.List[PlanMet
 def sql(self, sql_string: str) -> "DataFrame":
 return DataFrame.withPlan(SQL(sql_string), self)
 
+def range(
+self,
+start: int,
+end: int,
+step: Optional[int] = None,

Review Comment:
   furthermore, i think we can make `step` a required field in proto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-10-31 Thread GitBox


LuciferYang commented on code in PR #38430:
URL: https://github.com/apache/spark/pull/38430#discussion_r1010061538


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala:
##
@@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
 for (batchId <- batchIds if batchId > thresholdBatchId) {
   val path = batchIdToPath(batchId)
   fileManager.delete(path)
+  if (metadataCacheEnabled) batchCache.remove(batchId)
   logTrace(s"Removed metadata log file: $path")
 }
   }
 
+
+  /**
+   * List the available batches on file system. As a workaround for S3 
inconsistent list, it also
+   * tries to take `batchCache` into consideration to infer a better answer.
+   */
+  protected def listBatches: Array[Long] = {
+val batchIds = fileManager.list(metadataPath, batchFilesFilter)
+  .map(f => pathToBatchId(f.getPath)) ++
+  // Iterate over keySet is not thread safe. We call `toArray` to make a 
copy in the lock to
+  // elimiate the race condition.
+  batchCache.synchronized {
+batchCache.keySet.asScala.toArray

Review Comment:
   ~How about `batchCache.keySet().toArray`? ~
   ~And `SynchronizedCollection.toArray` seem threadsafe~



##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala:
##
@@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
 for (batchId <- batchIds if batchId > thresholdBatchId) {
   val path = batchIdToPath(batchId)
   fileManager.delete(path)
+  if (metadataCacheEnabled) batchCache.remove(batchId)
   logTrace(s"Removed metadata log file: $path")
 }
   }
 
+
+  /**
+   * List the available batches on file system. As a workaround for S3 
inconsistent list, it also
+   * tries to take `batchCache` into consideration to infer a better answer.
+   */
+  protected def listBatches: Array[Long] = {
+val batchIds = fileManager.list(metadataPath, batchFilesFilter)
+  .map(f => pathToBatchId(f.getPath)) ++
+  // Iterate over keySet is not thread safe. We call `toArray` to make a 
copy in the lock to
+  // elimiate the race condition.
+  batchCache.synchronized {
+batchCache.keySet.asScala.toArray

Review Comment:
   ~How about `batchCache.keySet().toArray`?~
   ~And `SynchronizedCollection.toArray` seem threadsafe~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-10-31 Thread GitBox


LuciferYang commented on code in PR #38430:
URL: https://github.com/apache/spark/pull/38430#discussion_r1010061538


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala:
##
@@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
 for (batchId <- batchIds if batchId > thresholdBatchId) {
   val path = batchIdToPath(batchId)
   fileManager.delete(path)
+  if (metadataCacheEnabled) batchCache.remove(batchId)
   logTrace(s"Removed metadata log file: $path")
 }
   }
 
+
+  /**
+   * List the available batches on file system. As a workaround for S3 
inconsistent list, it also
+   * tries to take `batchCache` into consideration to infer a better answer.
+   */
+  protected def listBatches: Array[Long] = {
+val batchIds = fileManager.list(metadataPath, batchFilesFilter)
+  .map(f => pathToBatchId(f.getPath)) ++
+  // Iterate over keySet is not thread safe. We call `toArray` to make a 
copy in the lock to
+  // elimiate the race condition.
+  batchCache.synchronized {
+batchCache.keySet.asScala.toArray

Review Comment:
   How about `batchCache.keySet().toArray`? 
   ~And `SynchronizedCollection.toArray` seem threadsafe~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-31 Thread GitBox


zhengruifeng commented on code in PR #38395:
URL: https://github.com/apache/spark/pull/38395#discussion_r1010073193


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -2101,3 +2101,53 @@ object AsOfJoin {
 }
   }
 }
+
+
+/**
+ * A logical plan for summary.
+ */
+case class UnresolvedSummary(
+child: LogicalPlan,
+statistics: Seq[String]) extends UnaryNode {
+
+  private lazy val supported =
+Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", 
"min", "max")
+
+  {
+// TODO: throw AnalysisException instead
+require(statistics.nonEmpty)
+val percentiles = statistics.filter(p => p.endsWith("%")).map { p =>
+  try {
+p.stripSuffix("%").toDouble / 100.0
+  } catch {
+case e: NumberFormatException =>
+  throw QueryExecutionErrors.cannotParseStatisticAsPercentileError(p, 
e)
+  }
+}
+require(percentiles.forall(p => p >= 0 && p <= 1), "Percentiles must be in 
the range [0, 1]")
+
+statistics.foreach {
+  case s if supported.contains(s) =>
+  case p if p.endsWith("%") =>
+  case s => throw QueryExecutionErrors.statisticNotRecognizedError(s)
+}
+  }
+
+  override protected def stringArgs: Iterator[Any] = super.stringArgs.take(5)
+
+  override lazy val resolved = false // Summary will be replaced after being 
resolved.
+
+  final override val nodePatterns: Seq[TreePattern] = Seq(SUMMARY)
+
+  override def output: Seq[Attribute] = {
+AttributeReference("summary", StringType)() +:
+  child.output.flatMap { attr =>
+if (attr.dataType.isInstanceOf[NumericType] || 
attr.dataType.isInstanceOf[StringType]) {
+  Some(AttributeReference(attr.name, StringType)())
+} else None

Review Comment:
   this is a `flatMap`, and `None` will be ignored
   currently datatypes other than `NumericType ` and `StringType ` will be 
discarded



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-31 Thread GitBox


zhengruifeng commented on code in PR #38395:
URL: https://github.com/apache/spark/pull/38395#discussion_r1010072313


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -2101,3 +2101,53 @@ object AsOfJoin {
 }
   }
 }
+
+
+/**
+ * A logical plan for summary.
+ */
+case class UnresolvedSummary(
+child: LogicalPlan,
+statistics: Seq[String]) extends UnaryNode {
+
+  private lazy val supported =
+Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", 
"min", "max")
+
+  {
+// TODO: throw AnalysisException instead
+require(statistics.nonEmpty)

Review Comment:
   since there are several UT checking the thrown exception, so I let it alone 
to keep the behavior.
   I think we can define new error classes in separate PRs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client

2022-10-31 Thread GitBox


amaliujia commented on PR #38460:
URL: https://github.com/apache/spark/pull/38460#issuecomment-1298026802

   R: @HyukjinKwon @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions`

2022-10-31 Thread GitBox


zhengruifeng commented on PR #38423:
URL: https://github.com/apache/spark/pull/38423#issuecomment-1298024181

   merged into master, thanks @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions`

2022-10-31 Thread GitBox


zhengruifeng closed pull request #38423: [SPARK-40949][CONNECT][PYTHON] 
Implement `DataFrame.sortWithinPartitions`
URL: https://github.com/apache/spark/pull/38423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38423: [SPARK-40949][CONNECT][PYTHON] Implement `DataFrame.sortWithinPartitions`

2022-10-31 Thread GitBox


zhengruifeng commented on PR #38423:
URL: https://github.com/apache/spark/pull/38423#issuecomment-1298023324

   the 3 force pushes only resolved conflicts in generated files, let me merge 
it now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation

2022-10-31 Thread GitBox


cloud-fan closed pull request #38415: [SPARK-40938][CONNECT] Support Alias for 
every type of Relation
URL: https://github.com/apache/spark/pull/38415


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation

2022-10-31 Thread GitBox


cloud-fan commented on PR #38415:
URL: https://github.com/apache/spark/pull/38415#issuecomment-1298019712

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ben-zhang commented on a diff in pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands

2022-10-31 Thread GitBox


ben-zhang commented on code in PR #38433:
URL: https://github.com/apache/spark/pull/38433#discussion_r1010063647


##
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##
@@ -1728,6 +1728,14 @@ class DataSourceV2SQLSuiteV1Filter extends 
DataSourceV2SQLSuite with AlterTableT
 }
   }
 
+  test("REPAIR TABLE omit MSCK") {

Review Comment:
   Sounds good, I can remove this one. I agree that e2e tests are not necessary 
for a parser change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-10-31 Thread GitBox


LuciferYang commented on code in PR #38430:
URL: https://github.com/apache/spark/pull/38430#discussion_r1010063341


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala:
##
@@ -19,7 +19,9 @@ package org.apache.spark.sql.execution.streaming
 
 import java.io._
 import java.nio.charset.StandardCharsets
+import java.util.{Collections, LinkedHashMap}

Review Comment:
   nit: can rename `LinkedHashMap` to `JLinkedHashMap` or others to make it 
clearer
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-10-31 Thread GitBox


LuciferYang commented on code in PR #38430:
URL: https://github.com/apache/spark/pull/38430#discussion_r1010062994


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala:
##
@@ -64,6 +67,17 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: 
SparkSession, path:
 fileManager.mkdirs(metadataPath)
   }
 
+  protected val metadataCacheEnabled: Boolean
+  = 
sparkSession.sessionState.conf.getConf(SQLConf.STREAMING_METADATA_CACHE_ENABLED)
+
+  /**
+   * Cache the latest two batches. [[StreamExecution]] usually just accesses 
the latest two batches
+   * when committing offsets, this cache will save some file system operations.
+   */
+  protected[sql] val batchCache = Collections.synchronizedMap(new 
LinkedHashMap[Long, T](2) {

Review Comment:
   Could we create `Map` instance only when `metadataCacheEnabled` is true? 
Maybe we can make it as `Option`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-10-31 Thread GitBox


LuciferYang commented on code in PR #38430:
URL: https://github.com/apache/spark/pull/38430#discussion_r1010061925


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala:
##
@@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
 for (batchId <- batchIds if batchId > thresholdBatchId) {
   val path = batchIdToPath(batchId)
   fileManager.delete(path)
+  if (metadataCacheEnabled) batchCache.remove(batchId)
   logTrace(s"Removed metadata log file: $path")
 }
   }
 
+
+  /**
+   * List the available batches on file system. As a workaround for S3 
inconsistent list, it also
+   * tries to take `batchCache` into consideration to infer a better answer.
+   */
+  protected def listBatches: Array[Long] = {
+val batchIds = fileManager.list(metadataPath, batchFilesFilter)
+  .map(f => pathToBatchId(f.getPath)) ++
+  // Iterate over keySet is not thread safe. We call `toArray` to make a 
copy in the lock to
+  // elimiate the race condition.
+  batchCache.synchronized {
+batchCache.keySet.asScala.toArray
+  }
+logInfo("BatchIds found from listing: " + batchIds.sorted.mkString(", "))
+
+if (batchIds.isEmpty) {
+  return Array.empty

Review Comment:
   nit: redundant `return`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #38430: [SPARK-40957] Add in memory cache in HDFSMetadataLog

2022-10-31 Thread GitBox


LuciferYang commented on code in PR #38430:
URL: https://github.com/apache/spark/pull/38430#discussion_r1010061538


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala:
##
@@ -277,10 +295,34 @@ class HDFSMetadataLog[T <: AnyRef : 
ClassTag](sparkSession: SparkSession, path:
 for (batchId <- batchIds if batchId > thresholdBatchId) {
   val path = batchIdToPath(batchId)
   fileManager.delete(path)
+  if (metadataCacheEnabled) batchCache.remove(batchId)
   logTrace(s"Removed metadata log file: $path")
 }
   }
 
+
+  /**
+   * List the available batches on file system. As a workaround for S3 
inconsistent list, it also
+   * tries to take `batchCache` into consideration to infer a better answer.
+   */
+  protected def listBatches: Array[Long] = {
+val batchIds = fileManager.list(metadataPath, batchFilesFilter)
+  .map(f => pathToBatchId(f.getPath)) ++
+  // Iterate over keySet is not thread safe. We call `toArray` to make a 
copy in the lock to
+  // elimiate the race condition.
+  batchCache.synchronized {
+batchCache.keySet.asScala.toArray

Review Comment:
   How about `batchCache.keySet().toArray`? And 
`SynchronizedCollection.toArray` seem threadsafe



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rangadi commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.

2022-10-31 Thread GitBox


rangadi commented on code in PR #38344:
URL: https://github.com/apache/spark/pull/38344#discussion_r1006324273


##
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala:
##
@@ -175,19 +175,26 @@ private[sql] object ProtobufUtils extends Logging {
   .asInstanceOf[Descriptor]
   }
 
+  // TODO: Revisit to ensure that messageName is searched through all imports

Review Comment:
   What is missing? Looks fairly complete to me.
   Better to state the problem here.



##
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala:
##
@@ -177,47 +175,75 @@ private[sql] object ProtobufUtils extends Logging {
   .asInstanceOf[Descriptor]
   }
 
+  // TODO: Revisit to ensure that messageName is searched through all imports
   def buildDescriptor(descFilePath: String, messageName: String): Descriptor = 
{
-val descriptor = 
parseFileDescriptor(descFilePath).getMessageTypes.asScala.find { desc =>
-  desc.getName == messageName || desc.getFullName == messageName
+val descriptorList = parseFileDescriptor(descFilePath).map(fileDescriptor 
=> {
+  fileDescriptor.getMessageTypes.asScala.find { desc =>
+desc.getName == messageName || desc.getFullName == messageName
+  }
+}).filter(f => !f.isEmpty)
+
+if (descriptorList.isEmpty) {
+  throw 
QueryCompilationErrors.noProtobufMessageTypeReturnError(messageName)
 }
 
-descriptor match {
+descriptorList.last match {
   case Some(d) => d
   case None =>
-throw new RuntimeException(s"Unable to locate Message '$messageName' 
in Descriptor")
+throw 
QueryCompilationErrors.unableToLocateProtobufMessageError(messageName)
 }
   }
 
-  private def parseFileDescriptor(descFilePath: String): 
Descriptors.FileDescriptor = {
+  private def parseFileDescriptor(descFilePath: String): 
List[Descriptors.FileDescriptor] = {

Review Comment:
   Rename to `parseFileDescriptorSet` (otherwise it sounds like it is parsing 
just one file descriptor). 



##
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala:
##
@@ -175,19 +175,26 @@ private[sql] object ProtobufUtils extends Logging {
   .asInstanceOf[Descriptor]
   }
 
+  // TODO: Revisit to ensure that messageName is searched through all imports
   def buildDescriptor(descFilePath: String, messageName: String): Descriptor = 
{
-val descriptor = 
parseFileDescriptor(descFilePath).getMessageTypes.asScala.find { desc =>
-  desc.getName == messageName || desc.getFullName == messageName
+val descriptorList = parseFileDescriptor(descFilePath).map(fileDescriptor 
=> {
+  fileDescriptor.getMessageTypes.asScala.find { desc =>
+desc.getName == messageName || desc.getFullName == messageName
+  }
+}).filter(f => !f.isEmpty)
+
+if (descriptorList.isEmpty) {
+  throw 
QueryCompilationErrors.noProtobufMessageTypeReturnError(messageName)
 }
 
-descriptor match {
+descriptorList.last match {

Review Comment:
   Could you add a comment on why we are picking the last one? Will be useful 
for future readers as well.



##
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/utils/ProtobufUtils.scala:
##
@@ -177,47 +175,75 @@ private[sql] object ProtobufUtils extends Logging {
   .asInstanceOf[Descriptor]
   }
 
+  // TODO: Revisit to ensure that messageName is searched through all imports
   def buildDescriptor(descFilePath: String, messageName: String): Descriptor = 
{
-val descriptor = 
parseFileDescriptor(descFilePath).getMessageTypes.asScala.find { desc =>
-  desc.getName == messageName || desc.getFullName == messageName
+val descriptorList = parseFileDescriptor(descFilePath).map(fileDescriptor 
=> {

Review Comment:
   Style: use `find()` rather than map().filter(). 
   
   (you can use `findLast()` if there is a reason to use the last match). 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #38460: [SPARK-40981][CONNECT][PYTHON] Support session.range in Python client

2022-10-31 Thread GitBox


amaliujia opened a new pull request, #38460:
URL: https://github.com/apache/spark/pull/38460

   
   
   ### What changes were proposed in this pull request?
   
   This PR adds `range` API to Python client's `RemoteSparkSession` with tests.
   
   ### Why are the changes needed?
   
   Improve API coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #38459: [SPARK-40980][CONNECT] Support session.sql in Connect DSL

2022-10-31 Thread GitBox


amaliujia opened a new pull request, #38459:
URL: https://github.com/apache/spark/pull/38459

   
   
   ### What changes were proposed in this pull request?
   
   This PR adds `def sql(sqlText: String)` to DSL and test the SQL proto with 
SparkSession/DataFrame.
   
   ### Why are the changes needed?
   
   Improve testing coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 opened a new pull request, #38458: [SPARK-40983][DOC] Remove Hadoop requirements for zstd mentioned in Parquet compression codec

2022-10-31 Thread GitBox


pan3793 opened a new pull request, #38458:
URL: https://github.com/apache/spark/pull/38458

   
   
   ### What changes were proposed in this pull request?
   
   Change the doc to remove Hadoop requirements for zstd mentioned in Parquet 
compression codec.
   
   ### Why are the changes needed?
   
   This requirement is removed after 
https://issues.apache.org/jira/browse/PARQUET-1866, and Spark uses Parquet 
1.12.3 now.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, doc updated.
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] clairezhuang commented on pull request #38446: [Spark-40974]When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified

2022-10-31 Thread GitBox


clairezhuang commented on PR #38446:
URL: https://github.com/apache/spark/pull/38446#issuecomment-1297983951

   Sorry for using the batch-3.3 as the mistake. The issue exists in the master 
branch, and I have no experience to make file change.
   I have closed the pull request. You could delete the pull request if it is 
possible.
   I would go to jira: 
https://issues.apache.org/jira/browse/SPARK-40982?jql=project%20%3D%20SPARK .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] clairezhuang closed pull request #38446: [Spark-40974]When the value of quote or escape exists in the content of csv file, the character in the csv file will be misidentified

2022-10-31 Thread GitBox


clairezhuang closed pull request #38446: [Spark-40974]When the value of quote 
or escape exists in the content of csv file, the character in the csv file will 
be misidentified
URL: https://github.com/apache/spark/pull/38446


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] dcoliversun commented on pull request #22: [SPARK-40969] Replace spark TGZ url with apache archive url

2022-10-31 Thread GitBox


dcoliversun commented on PR #22:
URL: https://github.com/apache/spark-docker/pull/22#issuecomment-1297973951

   Thanks for your review @HyukjinKwon @Yikun @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on pull request #21: [SPARK-40569] Add smoke test in standalone cluster for spark-docker

2022-10-31 Thread GitBox


Yikun commented on PR #21:
URL: https://github.com/apache/spark-docker/pull/21#issuecomment-1297967368

   Mind to rebase now? Thanks @dcoliversun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun commented on pull request #22: [SPARK-40969] Replace spark TGZ url with apache archive url

2022-10-31 Thread GitBox


Yikun commented on PR #22:
URL: https://github.com/apache/spark-docker/pull/22#issuecomment-1297966991

   @HyukjinKwon @zhengruifeng @dcoliversun Thanks! Merged .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark-docker] Yikun closed pull request #22: [SPARK-40969] Replace spark TGZ url with apache archive url

2022-10-31 Thread GitBox


Yikun closed pull request #22: [SPARK-40969] Replace spark TGZ url with apache 
archive url
URL: https://github.com/apache/spark-docker/pull/22


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JiexingLi commented on a diff in pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite

2022-10-31 Thread GitBox


JiexingLi commented on code in PR #38371:
URL: https://github.com/apache/spark/pull/38371#discussion_r1010030101


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -3089,13 +3089,14 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 submit(finalRdd, Array(0, 1), properties = new Properties())
 
 // Finish the first 2 shuffle map stages.
-completeShuffleMapStageSuccessfully(0, 0, 2)
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostB"))
 assert(mapOutputTracker.findMissingPartitions(shuffleId1) === 
Some(Seq.empty))
 
 completeShuffleMapStageSuccessfully(1, 0, 2, Seq("hostB", "hostD"))
 assert(mapOutputTracker.findMissingPartitions(shuffleId2) === 
Some(Seq.empty))
 
-// Executor lost on hostB, both of stage 0 and 1 should be reran.
+// FetchFailed on stage 2, both of stage 1 and 2 should be reran. Besides, 
executor lost on
+// hostB, both of stage 0 and 1 should be reran.

Review Comment:
   Thanks, Mridul. 
   
   I updated the comment. Your rewrite is very neat and clear. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JiexingLi commented on a diff in pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite

2022-10-31 Thread GitBox


JiexingLi commented on code in PR #38371:
URL: https://github.com/apache/spark/pull/38371#discussion_r1010029712


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -3089,13 +3089,14 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 submit(finalRdd, Array(0, 1), properties = new Properties())
 
 // Finish the first 2 shuffle map stages.
-completeShuffleMapStageSuccessfully(0, 0, 2)
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostB"))

Review Comment:
   Sounds good. Deleted the hostNames param.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #38457: [SPARK-40371][SQL] Migrate type check failures of NthValue and NTile onto error classes

2022-10-31 Thread GitBox


LuciferYang opened a new pull request, #38457:
URL: https://github.com/apache/spark/pull/38457

   ### What changes were proposed in this pull request?
   This pr aims to replace TypeCheckFailure by DataTypeMismatch in type checks 
in window expressions, includes `NthValue` and `NTile` 
   
   
   ### Why are the changes needed?
   Migration onto error classes unifies Spark SQL error messages.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. The PR changes user-facing error messages.
   
   
   
   ### How was this patch tested?
   Pass GitHub Actions
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang closed pull request #38456: Spark 40950

2022-10-31 Thread GitBox


LuciferYang closed pull request #38456: Spark 40950
URL: https://github.com/apache/spark/pull/38456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #38456: Spark 40950

2022-10-31 Thread GitBox


LuciferYang opened a new pull request, #38456:
URL: https://github.com/apache/spark/pull/38456

   ### What changes were proposed in this pull request?
   This pr aims to replace TypeCheckFailure by DataTypeMismatch in type checks 
in window expressions, includes `NthValue` and `NTile` 
   
   
   ### Why are the changes needed?
   Migration onto error classes unifies Spark SQL error messages.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. The PR changes user-facing error messages.
   
   
   
   ### How was this patch tested?
   Pass GitHub Actions
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38395:
URL: https://github.com/apache/spark/pull/38395#discussion_r1010024429


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveStatsFunctions.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.SUMMARY
+import org.apache.spark.sql.types._
+
+/**
+ * Resolve StatsFunctions.
+ */
+object ResolveStatsFunctions extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning(
+_.containsPattern(SUMMARY), ruleId) {
+
+case s @ UnresolvedSummary(child, statistics) if s.childrenResolved =>
+  val percentiles = statistics.filter(p => p.endsWith("%"))
+.map(p => p.stripSuffix("%").toDouble / 100.0)
+
+  var mapExprs = Seq.empty[NamedExpression]
+  child.output.foreach { attr =>
+if (attr.dataType.isInstanceOf[NumericType] || 
attr.dataType.isInstanceOf[StringType]) {
+  val name = attr.name
+  val casted: Expression = attr.dataType match {
+case StringType => Cast(attr, DoubleType, evalMode = EvalMode.TRY)
+case _ => attr
+  }
+
+  val approxPercentile = if (percentiles.nonEmpty) {
+Alias(
+  new ApproximatePercentile(
+casted,
+Literal(percentiles.toArray),
+Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)
+  ).toAggregateExpression(),
+  s"__${name}_approx_percentile__")()
+  } else null

Review Comment:
   Yeah looks this `val` is passed to `new Get` and there is no defensive check 
over `null` there. If there is ever a `null` case that appears, it will hit NPE.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38395:
URL: https://github.com/apache/spark/pull/38395#discussion_r1010010780


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveStatsFunctions.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.SUMMARY
+import org.apache.spark.sql.types._
+
+/**
+ * Resolve StatsFunctions.
+ */
+object ResolveStatsFunctions extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning(
+_.containsPattern(SUMMARY), ruleId) {
+
+case s @ UnresolvedSummary(child, statistics) if s.childrenResolved =>
+  val percentiles = statistics.filter(p => p.endsWith("%"))
+.map(p => p.stripSuffix("%").toDouble / 100.0)
+
+  var mapExprs = Seq.empty[NamedExpression]
+  child.output.foreach { attr =>
+if (attr.dataType.isInstanceOf[NumericType] || 
attr.dataType.isInstanceOf[StringType]) {
+  val name = attr.name
+  val casted: Expression = attr.dataType match {
+case StringType => Cast(attr, DoubleType, evalMode = EvalMode.TRY)
+case _ => attr
+  }
+
+  val approxPercentile = if (percentiles.nonEmpty) {
+Alias(
+  new ApproximatePercentile(
+casted,
+Literal(percentiles.toArray),
+Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)
+  ).toAggregateExpression(),
+  s"__${name}_approx_percentile__")()
+  } else null

Review Comment:
   I am seeing a potential issue of NPE. Is there a way that we get rid of 
`null` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you commented on a diff in pull request #38449: [SPARK-40798][SQL][FOLLOW-UP] Fix ansi mode in v2 ALTER TABLE PARTITION

2022-10-31 Thread GitBox


ulysses-you commented on code in PR #38449:
URL: https://github.com/apache/spark/pull/38449#discussion_r1010021095


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableAddPartitionSuite.scala:
##
@@ -124,18 +123,4 @@ class AlterTableAddPartitionSuite
   checkPartitions(t, Map("id" -> "1"), Map("id" -> "2"))
 }
   }
-
-  test("SPARK-40798: Alter partition should verify partition value - legacy") {
-withNamespaceAndTable("ns", "tbl") { t =>
-  sql(s"CREATE TABLE $t (c int) $defaultUsing PARTITIONED BY (p int)")
-
-  withSQLConf(
-  SQLConf.SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION.key -> "true",
-  SQLConf.ANSI_ENABLED.key -> "false") {

Review Comment:
   Before I thought the `storeAssignmentPolicy` is enough so this test is 
corvered by `AlterTableAddPartitionSuiteBase`. Just change to legacy to restroe 
previous behavior.
   
   Now I add one more `SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION` check and keep 
this test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #38455: [SPARK-40827][PS][TESTS] Re-enable the DataFrame.corrwith test after fixing in future pandas.

2022-10-31 Thread GitBox


itholic commented on PR #38455:
URL: https://github.com/apache/spark/pull/38455#issuecomment-1297940463

   Thanks for the review :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38415:
URL: https://github.com/apache/spark/pull/38415#discussion_r1010018988


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -213,3 +213,13 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation alias.
+message SubqueryAlias {
+  // Required. The input relation.
+  Relation input = 1;
+  // Required. The alias.

Review Comment:
   Given our discussion on the protocol, this is a required field so we ask 
clients to always set it. Server side only fetch the value in this field not 
matter what it is (either `""`, or not)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #38393: [SPARK-40915][CONNECT] Improve `on` in Join in Python client

2022-10-31 Thread GitBox


HyukjinKwon closed pull request #38393: [SPARK-40915][CONNECT] Improve `on` in 
Join in Python client
URL: https://github.com/apache/spark/pull/38393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38393: [SPARK-40915][CONNECT] Improve `on` in Join in Python client

2022-10-31 Thread GitBox


HyukjinKwon commented on PR #38393:
URL: https://github.com/apache/spark/pull/38393#issuecomment-1297935788

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38395:
URL: https://github.com/apache/spark/pull/38395#discussion_r1010014282


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -2101,3 +2101,53 @@ object AsOfJoin {
 }
   }
 }
+
+
+/**
+ * A logical plan for summary.
+ */
+case class UnresolvedSummary(
+child: LogicalPlan,
+statistics: Seq[String]) extends UnaryNode {
+
+  private lazy val supported =
+Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", 
"min", "max")
+
+  {
+// TODO: throw AnalysisException instead
+require(statistics.nonEmpty)

Review Comment:
   One example of the compilation error is 
   ```
   throw QueryCompilationErrors.windowSpecificationNotDefinedError(windowName))
   ```
   
   Maybe there is a need to define a new error class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #38438: [SPARK-40748][SQL] Migrate type check failures of conditions onto error classes

2022-10-31 Thread GitBox


panbingkun commented on code in PR #38438:
URL: https://github.com/apache/spark/pull/38438#discussion_r1010013950


##
sql/core/src/test/java/test/org/apache/spark/sql/JavaColumnExpressionSuite.java:
##
@@ -81,10 +81,13 @@ public void isInCollectionCheckExceptionMessage() {
 Dataset df = spark.createDataFrame(rows, schema);
 Exception e = Assert.assertThrows(Exception.class,
   () -> df.filter(df.col("a").isInCollection(Arrays.asList(new 
Column("b");
-Arrays.asList("cannot resolve",
-  "due to data type mismatch: Arguments must be same type but were")
-.forEach(s ->
-  Assert.assertTrue(e.getMessage().toLowerCase(Locale.ROOT)
-.contains(s.toLowerCase(Locale.ROOT;
+System.out.println(e.getMessage().toLowerCase(Locale.ROOT));
+Arrays.asList(
+  "datatype_mismatch.data_diff_types",
+  "cannot resolve \"(a in (b))\"",
+  "due to data type mismatch: input to `in` should all be the same type, " 
+
+"but it's [\"int\", \"array\"].").forEach(s ->
+  Assert.assertTrue(e.getMessage().toLowerCase(Locale.ROOT)
+.contains(s.toLowerCase(Locale.ROOT;

Review Comment:
   I will check:
   1.Exception Type -> AnalysisException
   2.Error classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38395:
URL: https://github.com/apache/spark/pull/38395#discussion_r1010010032


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -2101,3 +2101,53 @@ object AsOfJoin {
 }
   }
 }
+
+
+/**
+ * A logical plan for summary.
+ */
+case class UnresolvedSummary(
+child: LogicalPlan,
+statistics: Seq[String]) extends UnaryNode {
+
+  private lazy val supported =
+Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", 
"min", "max")
+
+  {
+// TODO: throw AnalysisException instead
+require(statistics.nonEmpty)

Review Comment:
   I am thinking this is easy to handle in this PR:
   
   either we return the original plan in `ResolveStatsFunctions` when we find 
`UnresolvedSummary` has empty `statistics` thus the plan will remain 
`unresolved` and will fail at last, or we through the compilation exception in 
`ResolveStatsFunctions` when seeing empty `statistics`.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -2101,3 +2101,53 @@ object AsOfJoin {
 }
   }
 }
+
+
+/**
+ * A logical plan for summary.
+ */
+case class UnresolvedSummary(
+child: LogicalPlan,
+statistics: Seq[String]) extends UnaryNode {
+
+  private lazy val supported =
+Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", 
"min", "max")
+
+  {
+// TODO: throw AnalysisException instead
+require(statistics.nonEmpty)

Review Comment:
   I am thinking this is easy to handle in this PR:
   
   either we return the original plan in `ResolveStatsFunctions` when we find 
`UnresolvedSummary` has empty `statistics` thus the plan will remain 
`unresolved` and will fail at last, or we throw the compilation exception in 
`ResolveStatsFunctions` when seeing empty `statistics`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] swamirishi commented on pull request #38377: [SPARK-40901][CORE] Unable to store Spark Driver logs with Absolute Hadoop based URI FS Path

2022-10-31 Thread GitBox


swamirishi commented on PR #38377:
URL: https://github.com/apache/spark/pull/38377#issuecomment-1297930510

   > QQ: Any details for the query 
[above](https://github.com/apache/spark/pull/38377#pullrequestreview-1153485176)
 ?
   
   Currently all of the logs can be only written to the default filesytem 
defined by fs.Defaultfs job config. One wouldn't be able to write the logs to 
other filesystem directly without changing the property as absolute urls are 
not supported. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #38438: [SPARK-40748][SQL] Migrate type check failures of conditions onto error classes

2022-10-31 Thread GitBox


panbingkun commented on code in PR #38438:
URL: https://github.com/apache/spark/pull/38438#discussion_r1010013499


##
sql/core/src/test/java/test/org/apache/spark/sql/JavaColumnExpressionSuite.java:
##
@@ -81,10 +81,13 @@ public void isInCollectionCheckExceptionMessage() {
 Dataset df = spark.createDataFrame(rows, schema);
 Exception e = Assert.assertThrows(Exception.class,
   () -> df.filter(df.col("a").isInCollection(Arrays.asList(new 
Column("b");
-Arrays.asList("cannot resolve",
-  "due to data type mismatch: Arguments must be same type but were")
-.forEach(s ->
-  Assert.assertTrue(e.getMessage().toLowerCase(Locale.ROOT)
-.contains(s.toLowerCase(Locale.ROOT;
+System.out.println(e.getMessage().toLowerCase(Locale.ROOT));

Review Comment:
   Sorry, I left the debugging code by mistake.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #38451: [SPARK-40976][BUILD] Upgrade sbt to 1.7.3

2022-10-31 Thread GitBox


LuciferYang commented on PR #38451:
URL: https://github.com/apache/spark/pull/38451#issuecomment-1297928916

   thanks @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


cloud-fan commented on PR #38347:
URL: https://github.com/apache/spark/pull/38347#issuecomment-1297927909

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


cloud-fan closed pull request #38347: [SPARK-40883][CONNECT] Support Range in 
Connect proto
URL: https://github.com/apache/spark/pull/38347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38395:
URL: https://github.com/apache/spark/pull/38395#discussion_r1010010032


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -2101,3 +2101,53 @@ object AsOfJoin {
 }
   }
 }
+
+
+/**
+ * A logical plan for summary.
+ */
+case class UnresolvedSummary(
+child: LogicalPlan,
+statistics: Seq[String]) extends UnaryNode {
+
+  private lazy val supported =
+Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", 
"min", "max")
+
+  {
+// TODO: throw AnalysisException instead
+require(statistics.nonEmpty)

Review Comment:
   I am thinking this is easy to handle in this PR:
   
   either we return the plan in `ResolveStatsFunctions` when we find 
`UnresolvedSummary` has empty `statistics`, or we through the exception in 
`ResolveStatsFunctions` when seeing empty `statistics`.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveStatsFunctions.scala:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.SUMMARY
+import org.apache.spark.sql.types._
+
+/**
+ * Resolve StatsFunctions.
+ */
+object ResolveStatsFunctions extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning(
+_.containsPattern(SUMMARY), ruleId) {
+
+case s @ UnresolvedSummary(child, statistics) if s.childrenResolved =>
+  val percentiles = statistics.filter(p => p.endsWith("%"))
+.map(p => p.stripSuffix("%").toDouble / 100.0)
+
+  var mapExprs = Seq.empty[NamedExpression]
+  child.output.foreach { attr =>
+if (attr.dataType.isInstanceOf[NumericType] || 
attr.dataType.isInstanceOf[StringType]) {
+  val name = attr.name
+  val casted: Expression = attr.dataType match {
+case StringType => Cast(attr, DoubleType, evalMode = EvalMode.TRY)
+case _ => attr
+  }
+
+  val approxPercentile = if (percentiles.nonEmpty) {
+Alias(
+  new ApproximatePercentile(
+casted,
+Literal(percentiles.toArray),
+Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)
+  ).toAggregateExpression(),
+  s"__${name}_approx_percentile__")()
+  } else null

Review Comment:
   I am see a potential issue of NPE. Is there a way that we get rid of `null` 
here?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -2101,3 +2101,53 @@ object AsOfJoin {
 }
   }
 }
+
+
+/**
+ * A logical plan for summary.
+ */
+case class UnresolvedSummary(
+child: LogicalPlan,
+statistics: Seq[String]) extends UnaryNode {
+
+  private lazy val supported =
+Set("count", "count_distinct", "approx_count_distinct", "mean", "stddev", 
"min", "max")
+
+  {
+// TODO: throw AnalysisException instead
+require(statistics.nonEmpty)
+val percentiles = statistics.filter(p => p.endsWith("%")).map { p =>
+  try {
+p.stripSuffix("%").toDouble / 100.0
+  } catch {
+case e: NumberFormatException =>
+  throw QueryExecutionErrors.cannotParseStatisticAsPercentileError(p, 
e)
+  }
+}
+require(percentiles.forall(p => p >= 0 && p <= 1), "Percentiles must be in 
the range [0, 1]")
+
+statistics.foreach {
+  case s if supported.contains(s) =>
+  case p if p.endsWith("%") =>
+  case s => throw QueryExecutionErrors.statisticNotRecognizedError(s)
+}
+  }
+
+  override protected def stringArgs: Iterator[Any] = super.stringArgs.take(5)
+
+  override lazy val resolved = false // Summary will be replaced after being 
resolved.
+
+  final override val nodePatterns: Seq[TreePattern] = Seq(SUMMARY)
+
+  override def output: Seq[Attribute] = {
+  

[GitHub] [spark] panbingkun commented on pull request #38439: [SPARK-40890][SQL][TESTS] Check error classes in DataSourceV2SQLSuite

2022-10-31 Thread GitBox


panbingkun commented on PR #38439:
URL: https://github.com/apache/spark/pull/38439#issuecomment-1297915406

   > It seems this is related to your changes:
   > 
   > ```
   > DataSourceV2SQLSuiteV1Filter.MERGE INTO TABLE
   > org.scalatest.exceptions.TestFailedException: "_LEGACY_ERROR_TEMP_2309" 
did not equal null
   > ```
   > 
   > Most likely after 
[3c967f0](https://github.com/apache/spark/commit/3c967f06e6c37360e53f9c7a5ecff95ae818e713),
 need to merge/rebase the master.
   
   Done, waiting for CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #38371: [SPARK-40968] Fix a few wrong/misleading comments in DAGSchedulerSuite

2022-10-31 Thread GitBox


mridulm commented on code in PR #38371:
URL: https://github.com/apache/spark/pull/38371#discussion_r1010007062


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -3089,13 +3089,14 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 submit(finalRdd, Array(0, 1), properties = new Properties())
 
 // Finish the first 2 shuffle map stages.
-completeShuffleMapStageSuccessfully(0, 0, 2)
+completeShuffleMapStageSuccessfully(0, 0, 2, Seq("hostA", "hostB"))

Review Comment:
   Enhancing documentation of `completeShuffleMapStageSuccessfully` is 
sufficient - we dont want to explicitly specify `hostNames` for all invocations 
of the method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38420: [SPARK-40947][SPARK-40966][PS][INFRA][TEST] Upgrade pandas to 1.5.1

2022-10-31 Thread GitBox


zhengruifeng commented on PR #38420:
URL: https://github.com/apache/spark/pull/38420#issuecomment-1297914857

   late lgtm , thanks all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #38455: [SPARK-40827][PS][TEST] Re-enable the DataFrame.corrwith test after fixing in future pandas.

2022-10-31 Thread GitBox


itholic opened a new pull request, #38455:
URL: https://github.com/apache/spark/pull/38455

   ### What changes were proposed in this pull request?
   
   This PR proposes to make the manual tests for `DataFrame.corrwith` back into 
formal approach, if the pandas version is not 1.5.0.
   
   ### Why are the changes needed?
   
   There was a regression introduced by pandas 1.5.0 
(https://github.com/pandas-dev/pandas/issues/48826), and seems it's resolved 
now.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   
   ### How was this patch tested?
   
   
   The fixed test should pass the CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-31 Thread GitBox


HyukjinKwon commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1010003793


##
python/pyspark/worker.py:
##
@@ -146,7 +146,74 @@ def verify_result_type(result):
 )
 
 
-def wrap_cogrouped_map_pandas_udf(f, return_type, argspec):
+def verify_pandas_result(result, return_type, assign_cols_by_name):
+import pandas as pd
+
+if not isinstance(result, pd.DataFrame):
+raise TypeError(
+"Return type of the user-defined function should be "
+"pandas.DataFrame, but is {}".format(type(result))
+)
+
+# check the schema of the result only if it is not empty or has columns
+if not result.empty or len(result.columns) != 0:
+# if any column name of the result is a string
+# the column names of the result have to match the return type
+#   see create_array in 
pyspark.sql.pandas.serializers.ArrowStreamPandasSerializer
+field_names = set([field.name for field in return_type.fields])
+column_names = set(result.columns)
+if (
+assign_cols_by_name
+and any(isinstance(name, str) for name in result.columns)
+and column_names != field_names
+):
+limit = 5

Review Comment:
   e.g., would have to consider having a config to control this too ..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #38377: [SPARK-40901][CORE] Unable to store Spark Driver logs with Absolute Hadoop based URI FS Path

2022-10-31 Thread GitBox


mridulm commented on PR #38377:
URL: https://github.com/apache/spark/pull/38377#issuecomment-1297907890

   QQ: Any details for the query 
[above](https://github.com/apache/spark/pull/38377#pullrequestreview-1153485176)
 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-31 Thread GitBox


HyukjinKwon commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1010003153


##
python/pyspark/worker.py:
##
@@ -146,7 +146,74 @@ def verify_result_type(result):
 )
 
 
-def wrap_cogrouped_map_pandas_udf(f, return_type, argspec):
+def verify_pandas_result(result, return_type, assign_cols_by_name):
+import pandas as pd
+
+if not isinstance(result, pd.DataFrame):
+raise TypeError(
+"Return type of the user-defined function should be "
+"pandas.DataFrame, but is {}".format(type(result))
+)
+
+# check the schema of the result only if it is not empty or has columns
+if not result.empty or len(result.columns) != 0:
+# if any column name of the result is a string
+# the column names of the result have to match the return type
+#   see create_array in 
pyspark.sql.pandas.serializers.ArrowStreamPandasSerializer
+field_names = set([field.name for field in return_type.fields])
+column_names = set(result.columns)
+if (
+assign_cols_by_name
+and any(isinstance(name, str) for name in result.columns)
+and column_names != field_names
+):
+limit = 5

Review Comment:
   Hm, I wonder if this abbreviation is too much (and the character limit) ... 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-31 Thread GitBox


HyukjinKwon commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1010002613


##
python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py:
##
@@ -165,100 +148,191 @@ def merge_pandas(lft, _):
 )
 
 def test_apply_in_pandas_not_returning_pandas_dataframe(self):
-left = self.data1
-right = self.data2
+self._test_merge_error(
+fn=lambda lft, rgt: lft.size + rgt.size,
+error_class=PythonException,
+error_message_regex="Return type of the user-defined function "
+"should be pandas.DataFrame, but is ",
+)
+
+def test_apply_in_pandas_returning_column_names(self):
+self._test_merge(fn=lambda lft, rgt: pd.merge(lft, rgt, on=["id", 
"k"]))
 
+def test_apply_in_pandas_returning_no_column_names(self):
 def merge_pandas(lft, rgt):
-return lft.size + rgt.size
+res = pd.merge(lft, rgt, on=["id", "k"])
+res.columns = range(res.columns.size)
+return res
 
-with QuietTest(self.sc):
-with self.assertRaisesRegex(
-PythonException,
-"Return type of the user-defined function should be 
pandas.DataFrame, "
-"but is ",
-):
-(
-left.groupby("id")
-.cogroup(right.groupby("id"))
-.applyInPandas(merge_pandas, "id long, k int, v int, v2 
int")
-.collect()
-)
+self._test_merge(fn=merge_pandas)
 
-def test_apply_in_pandas_returning_wrong_number_of_columns(self):
-left = self.data1
-right = self.data2
+def test_apply_in_pandas_returning_column_names_sometimes(self):
+def merge_pandas(lft, rgt):
+res = pd.merge(lft, rgt, on=["id", "k"])
+if 0 in lft["id"] and lft["id"][0] % 2 == 0:
+return res
+res.columns = range(res.columns.size)
+return res
+
+self._test_merge(fn=merge_pandas)
 
+def test_apply_in_pandas_returning_wrong_column_names(self):
 def merge_pandas(lft, rgt):
 if 0 in lft["id"] and lft["id"][0] % 2 == 0:
 lft["add"] = 0
 if 0 in rgt["id"] and rgt["id"][0] % 3 == 0:
 rgt["more"] = 1
 return pd.merge(lft, rgt, on=["id", "k"])
 
-with QuietTest(self.sc):
-with self.assertRaisesRegex(
-PythonException,
-"Number of columns of the returned pandas.DataFrame "
-"doesn't match specified schema. Expected: 4 Actual: 6",
-):
-(
-# merge_pandas returns two columns for even keys while we 
set schema to four
-left.groupby("id")
-.cogroup(right.groupby("id"))
-.applyInPandas(merge_pandas, "id long, k int, v int, v2 
int")
-.collect()
-)
+self._test_merge_error(
+fn=merge_pandas,
+error_class=PythonException,
+error_message_regex="Column names of the returned pandas.DataFrame 
"
+"do not match specified schema.  Unexpected: add, more  Schema: 
id, k, v, v2\n",
+)
 
-def test_apply_in_pandas_returning_empty_dataframe(self):
-left = self.data1
-right = self.data2
+# with very large schema, missing and unexpected is limited to 5
+# and the schema is abbreviated in the error message
+schema = "id long, k long, mean double, " + ", ".join(
+f"column_with_long_column_name_{no} integer" for no in range(35)
+)
+self._test_merge_error(
+fn=lambda lft, rgt: pd.DataFrame(
+[
+(
+lft.id,
+lft.k,
+lft.v.mean(),
+)
++ tuple(lft.v.mean() for _ in range(7))
+],
+columns=["id", "k", "mean"] + [f"extra_column_{no} integer" 
for no in range(7)],
+),
+output_schema=schema,
+error_class=PythonException,
+error_message_regex="Column names of the returned 
pandas\\.DataFrame "
+"do not match specified schema\\.  "
+"Missing \\(first 5 of 35\\): column_with_long_column_name_0,"
+" column_with_long_column_name_1, column_with_long_column_name_10,"
+" column_with_long_column_name_11, column_with_long_column_name_12 
 "
+"Unexpected \\(first 5 of 7\\): extra_column_0 integer, 
extra_column_1 integer,"
+" extra_column_2 integer, extra_column_3 integer, extra_column_4 
integer  "
+"Schema: id, k, mean, column_with_long_column_name_0, 
column_with_long_column_name_1,"
+" 

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-31 Thread GitBox


HyukjinKwon commented on code in PR #38223:
URL: https://github.com/apache/spark/pull/38223#discussion_r1010002388


##
python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py:
##
@@ -165,100 +148,191 @@ def merge_pandas(lft, _):
 )
 
 def test_apply_in_pandas_not_returning_pandas_dataframe(self):
-left = self.data1
-right = self.data2
+self._test_merge_error(
+fn=lambda lft, rgt: lft.size + rgt.size,
+error_class=PythonException,
+error_message_regex="Return type of the user-defined function "
+"should be pandas.DataFrame, but is ",
+)
+
+def test_apply_in_pandas_returning_column_names(self):
+self._test_merge(fn=lambda lft, rgt: pd.merge(lft, rgt, on=["id", 
"k"]))
 
+def test_apply_in_pandas_returning_no_column_names(self):
 def merge_pandas(lft, rgt):
-return lft.size + rgt.size
+res = pd.merge(lft, rgt, on=["id", "k"])
+res.columns = range(res.columns.size)
+return res
 
-with QuietTest(self.sc):
-with self.assertRaisesRegex(
-PythonException,
-"Return type of the user-defined function should be 
pandas.DataFrame, "
-"but is ",
-):
-(
-left.groupby("id")
-.cogroup(right.groupby("id"))
-.applyInPandas(merge_pandas, "id long, k int, v int, v2 
int")
-.collect()
-)
+self._test_merge(fn=merge_pandas)
 
-def test_apply_in_pandas_returning_wrong_number_of_columns(self):
-left = self.data1
-right = self.data2
+def test_apply_in_pandas_returning_column_names_sometimes(self):
+def merge_pandas(lft, rgt):
+res = pd.merge(lft, rgt, on=["id", "k"])
+if 0 in lft["id"] and lft["id"][0] % 2 == 0:
+return res
+res.columns = range(res.columns.size)
+return res
+
+self._test_merge(fn=merge_pandas)
 
+def test_apply_in_pandas_returning_wrong_column_names(self):
 def merge_pandas(lft, rgt):
 if 0 in lft["id"] and lft["id"][0] % 2 == 0:
 lft["add"] = 0
 if 0 in rgt["id"] and rgt["id"][0] % 3 == 0:
 rgt["more"] = 1
 return pd.merge(lft, rgt, on=["id", "k"])
 
-with QuietTest(self.sc):
-with self.assertRaisesRegex(
-PythonException,
-"Number of columns of the returned pandas.DataFrame "
-"doesn't match specified schema. Expected: 4 Actual: 6",
-):
-(
-# merge_pandas returns two columns for even keys while we 
set schema to four
-left.groupby("id")
-.cogroup(right.groupby("id"))
-.applyInPandas(merge_pandas, "id long, k int, v int, v2 
int")
-.collect()
-)
+self._test_merge_error(
+fn=merge_pandas,
+error_class=PythonException,
+error_message_regex="Column names of the returned pandas.DataFrame 
"
+"do not match specified schema.  Unexpected: add, more  Schema: 
id, k, v, v2\n",
+)
 
-def test_apply_in_pandas_returning_empty_dataframe(self):
-left = self.data1
-right = self.data2
+# with very large schema, missing and unexpected is limited to 5
+# and the schema is abbreviated in the error message
+schema = "id long, k long, mean double, " + ", ".join(
+f"column_with_long_column_name_{no} integer" for no in range(35)
+)
+self._test_merge_error(
+fn=lambda lft, rgt: pd.DataFrame(
+[
+(
+lft.id,
+lft.k,
+lft.v.mean(),
+)
++ tuple(lft.v.mean() for _ in range(7))
+],
+columns=["id", "k", "mean"] + [f"extra_column_{no} integer" 
for no in range(7)],
+),
+output_schema=schema,
+error_class=PythonException,
+error_message_regex="Column names of the returned 
pandas\\.DataFrame "
+"do not match specified schema\\.  "
+"Missing \\(first 5 of 35\\): column_with_long_column_name_0,"
+" column_with_long_column_name_1, column_with_long_column_name_10,"
+" column_with_long_column_name_11, column_with_long_column_name_12 
 "
+"Unexpected \\(first 5 of 7\\): extra_column_0 integer, 
extra_column_1 integer,"
+" extra_column_2 integer, extra_column_3 integer, extra_column_4 
integer  "
+"Schema: id, k, mean, column_with_long_column_name_0, 
column_with_long_column_name_1,"
+" 

[GitHub] [spark] zhengruifeng commented on pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`

2022-10-31 Thread GitBox


zhengruifeng commented on PR #38395:
URL: https://github.com/apache/spark/pull/38395#issuecomment-1297904765

   @cloud-fan @HyukjinKwon would you mind taking another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch

2022-10-31 Thread GitBox


HyukjinKwon commented on PR #38223:
URL: https://github.com/apache/spark/pull/38223#issuecomment-1297904278

   Since this touches the core Python worker side, let's make sure to get 
reviewed from other people like @BryanCutler @viirya @ueshin too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #38442: [SPARK-40953][CONNECT][PYTHON][TESTS][FOLLOW-UP] Add a test case for head(n)

2022-10-31 Thread GitBox


zhengruifeng commented on PR #38442:
URL: https://github.com/apache/spark/pull/38442#issuecomment-1297902530

   late lgtm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands

2022-10-31 Thread GitBox


cloud-fan commented on code in PR #38433:
URL: https://github.com/apache/spark/pull/38433#discussion_r1009998588


##
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##
@@ -1728,6 +1728,14 @@ class DataSourceV2SQLSuiteV1Filter extends 
DataSourceV2SQLSuite with AlterTableT
 }
   }
 
+  test("REPAIR TABLE omit MSCK") {

Review Comment:
   oh this is not parser test. But I think a parser change only needs parser 
test, no end-to-end test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands

2022-10-31 Thread GitBox


cloud-fan commented on code in PR #38433:
URL: https://github.com/apache/spark/pull/38433#discussion_r1009998364


##
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala:
##
@@ -1728,6 +1728,14 @@ class DataSourceV2SQLSuiteV1Filter extends 
DataSourceV2SQLSuite with AlterTableT
 }
   }
 
+  test("REPAIR TABLE omit MSCK") {

Review Comment:
   can we move the test to `MsckRepairTableParserSuite`? And probably rename 
the test suite. And also `MsckRepairTableSuiteBase` and sub classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands

2022-10-31 Thread GitBox


cloud-fan commented on PR #38433:
URL: https://github.com/apache/spark/pull/38433#issuecomment-1297895255

   Making `MSCK` optional makes our SQL syntax more flexible (and simpler), and 
compatible with MySQL. And there seems no down side. @dongjoon-hyun Can you 
share your concerns? We can discuss more here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #38422: [SPARK-40948][SQL] Introduce new error class: PATH_NOT_FOUND

2022-10-31 Thread GitBox


itholic commented on PR #38422:
URL: https://github.com/apache/spark/pull/38422#issuecomment-1297892959

   The latest CI failure is not reproduced on my test env.
   Let me just try to rebase with master branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38449: [SPARK-40798][SQL][FOLLOW-UP] Fix ansi mode in v2 ALTER TABLE PARTITION

2022-10-31 Thread GitBox


HyukjinKwon commented on code in PR #38449:
URL: https://github.com/apache/spark/pull/38449#discussion_r1009993046


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableAddPartitionSuite.scala:
##
@@ -124,18 +123,4 @@ class AlterTableAddPartitionSuite
   checkPartitions(t, Map("id" -> "1"), Map("id" -> "2"))
 }
   }
-
-  test("SPARK-40798: Alter partition should verify partition value - legacy") {
-withNamespaceAndTable("ns", "tbl") { t =>
-  sql(s"CREATE TABLE $t (c int) $defaultUsing PARTITIONED BY (p int)")
-
-  withSQLConf(
-  SQLConf.SKIP_TYPE_VALIDATION_ON_ALTER_PARTITION.key -> "true",
-  SQLConf.ANSI_ENABLED.key -> "false") {

Review Comment:
   Shouldn't we maybe keep the test after removing `SQLConf.ANSI_ENABLED.key -> 
"false"`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vinodkc commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'

2022-10-31 Thread GitBox


vinodkc commented on code in PR #38263:
URL: https://github.com/apache/spark/pull/38263#discussion_r1009985691


##
sql/core/src/test/resources/sql-tests/inputs/string-functions.sql:
##
@@ -58,6 +58,14 @@ SELECT substring('Spark SQL' from 5);
 SELECT substring('Spark SQL' from -3);
 SELECT substring('Spark SQL' from 5 for 1);
 
+-- mask function
+select mask_hash('TestString-123'),

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #37251: [SPARK-39838][SQL] Preserve explicit empty column metadata

2022-10-31 Thread GitBox


github-actions[bot] closed pull request #37251: [SPARK-39838][SQL] Preserve 
explicit empty column metadata
URL: https://github.com/apache/spark/pull/37251


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options

2022-10-31 Thread GitBox


gengliangwang closed pull request #38418: [SPARK-40944][SQL] Relax ordering 
constraint for CREATE TABLE column options
URL: https://github.com/apache/spark/pull/38418


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options

2022-10-31 Thread GitBox


gengliangwang commented on PR #38418:
URL: https://github.com/apache/spark/pull/38418#issuecomment-1297829168

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug

2022-10-31 Thread GitBox


sadikovi commented on PR #38277:
URL: https://github.com/apache/spark/pull/38277#issuecomment-1297826231

   Thank you @dongjoon-hyun for merging .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands

2022-10-31 Thread GitBox


srielau commented on PR #38433:
URL: https://github.com/apache/spark/pull/38433#issuecomment-1297817456

   > May I ask if dropping `MSCK` is the official Databricks's direction in 
order to do the lead, @srielau ?
   > 
   > > As perhaps the most popular platform, we have the chance to lead here 
and make things easier.
   
   Yes, Databricks will make MSCK a noise word. I will probably remove it from 
our documented syntax and add a footnote such as "for compatibility with Hive 
you may also use MSCK REPAIR TABLE".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug

2022-10-31 Thread GitBox


dongjoon-hyun commented on PR #38277:
URL: https://github.com/apache/spark/pull/38277#issuecomment-1297815756

   Thank you so much, @sadikovi , @sunchao , @mridulm .
   Merged to master for Apache Spark 3.4.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug

2022-10-31 Thread GitBox


dongjoon-hyun closed pull request #38277: [SPARK-40815][SQL] Add 
`DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug
URL: https://github.com/apache/spark/pull/38277


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug

2022-10-31 Thread GitBox


sadikovi commented on PR #38277:
URL: https://github.com/apache/spark/pull/38277#issuecomment-1297806098

   Thanks for updating the PR title .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands

2022-10-31 Thread GitBox


dongjoon-hyun commented on PR #38433:
URL: https://github.com/apache/spark/pull/38433#issuecomment-1297804976

   May I ask if dropping `MSCK` is the official Databricks's direction in order 
to do the lead, @srielau ?
   > As perhaps the most popular platform, we have the chance to lead here and 
make things easier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on pull request #38433: [SPARK-40943][SQL] Make the MSCK keyword optional in MSCK REPAIR TABLE commands

2022-10-31 Thread GitBox


srielau commented on PR #38433:
URL: https://github.com/apache/spark/pull/38433#issuecomment-1297801930

   > If MySQL is the only DBMS for this syntax, I'm negative about this change 
because this PR only increases the chance of SQL incompatibility. Is there any 
other support cases?
   
   IMHO mySQL was onto something to not include MSCK (who is the chicken and 
who the egg here anyway?).
   It seems like everyone (e.g. IBM) just copied Hive to not rock the boat.
   
   Interestingly our own docs omit MSCK from the title (and the file name):
   
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-repair-table.html
   
   It seems the only purpose of this keyword is to trip people up.
   As perhaps the most popular platform, we have the chance to lead here and 
make things easier.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38277: [SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug

2022-10-31 Thread GitBox


dongjoon-hyun commented on PR #38277:
URL: https://github.com/apache/spark/pull/38277#issuecomment-1297752636

   I revise the PR title to simplify, @sadikovi . You can change it back if you 
want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #38277: [SPARK-40815][SQL] Introduce DelegateSymlinkTextInputFormat to handle empty splits when "spark.hadoopRDD.ignoreEmptySplits" is ena

2022-10-31 Thread GitBox


sadikovi commented on code in PR #38277:
URL: https://github.com/apache/spark/pull/38277#discussion_r1009895816


##
sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java:
##
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+
+import org.apache.hadoop.fs.ContentSummary;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+
+/**
+ * Delegate for SymlinkTextInputFormat, created to address SPARK-40815.
+ * Fixes an issue where SymlinkTextInputFormat returns empty splits which 
could result in
+ * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled.
+ *

Review Comment:
   Updated.



##
sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java:
##
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+
+import org.apache.hadoop.fs.ContentSummary;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+
+/**
+ * Delegate for SymlinkTextInputFormat, created to address SPARK-40815.
+ * Fixes an issue where SymlinkTextInputFormat returns empty splits which 
could result in
+ * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled.
+ *
+ * In this class, we update the split start and length to match the target 
file input thus fixing
+ * the issue.
+ */
+@SuppressWarnings("deprecation")

Review Comment:
   Removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #38441: [SPARK-40979][CORE] Keep removed executor info in decommission state

2022-10-31 Thread GitBox


AmplabJenkins commented on PR #38441:
URL: https://github.com/apache/spark/pull/38441#issuecomment-1297692337

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk opened a new pull request, #38454: [WIP][SPARK-40978][SQL] Migrate `failAnalysis()` onto error classes

2022-10-31 Thread GitBox


MaxGekk opened a new pull request, #38454:
URL: https://github.com/apache/spark/pull/38454

   ### What changes were proposed in this pull request?
   In the PR, I propose to migrate `failAnalysis()` errors without a context 
onto temporary error classes with the prefix `_LEGACY_ERROR_TEMP_24xx`. The 
error message will not include the error classes, so, in this way we will 
preserve the existing behaviour.
   
   ### Why are the changes needed?
   The migration on temporary error classes allows to gather statistics about 
errors and detect most popular error classes. After that we could prioritise 
the work on migration.
   
   The new error class name prefix `_LEGACY_ERROR_TEMP_` proposed here kind of 
marks the error as developer-facing, not user-facing. Developers can still get 
the error class programmatically via the `SparkThrowable` interface, so that 
they can build error infra with it. End users won't see the error class in the 
message. This allows us to do the error migration very quickly, and we can 
refine the error classes and mark them as user-facing later (naming them 
properly, adding tests, etc.).
   
   ### Does this PR introduce _any_ user-facing change?
   No. The error messages should be almost the same by default.
   
   ### How was this patch tested?
   By running the affected test suites:
   ```
   $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] warrenzhu25 commented on pull request #38441: [SPARK-40979][CORE] Keep removed executor info in decommission state

2022-10-31 Thread GitBox


warrenzhu25 commented on PR #38441:
URL: https://github.com/apache/spark/pull/38441#issuecomment-1297682478

   > Could you explain how it can prevent your problem completely? That sounds 
still greedy and best-effort approach to mitigate your issue.
   > 
   > > What I mean is something like below, the memory added is constant 
regardless of cluster size. @dongjoon-hyun
   > > ```
   > > // limit the set size to 100 or 1000
   > > val executorsRemovedDueToDecom = new HashSet[String]
   > > 
   > > executorsPendingDecommission.remove(executorId)
   > > executorsRemovedDueToDecom.add(executorId)
   > > ```
   
   FetchFailed caused by decom executor could be divided into 2 categories:
   
   1. When FetchFailed reached DAGScheduler, the executor is still alive or is 
lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
handled in [SPARK-40979](https://issues.apache.org/jira/browse/SPARK-40979)
   2. FetchFailed is caused by decom executor loss, so the decom info is 
already removed in TaskSchedulerImpl. If we keep such info in a short period, 
it is good enough. Even we limit the size of removed executors to 10K, it could 
be only at most 10MB memory usage. In real case, it's rare to have cluster size 
of over 10K and the chance that all these executors decomed and lost at the 
same time would be small.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38441: [SPARK-40979][CORE] Keep removed executor info in decommission state

2022-10-31 Thread GitBox


dongjoon-hyun commented on PR #38441:
URL: https://github.com/apache/spark/pull/38441#issuecomment-1297671091

   As I mentioned before, this cannot be a followup. I created SPARK-40979 for 
you because you want to discuss this more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38441: [SPARK-40481][CORE][FOLLOWUP] Keep removed executor info in decommission state

2022-10-31 Thread GitBox


dongjoon-hyun commented on PR #38441:
URL: https://github.com/apache/spark/pull/38441#issuecomment-129747

   Could you explain how it can prevent your problem completely? That sounds 
still greedy and best-effort approach to mitigate your issue.
   
   > What I mean is something like below, the memory added is constant 
regardless of cluster size. @dongjoon-hyun
   > 
   > ```
   > // limit the set size to 100 or 1000
   > val executorsRemovedDueToDecom = new HashSet[String]
   > 
   > executorsPendingDecommission.remove(executorId)
   > executorsRemovedDueToDecom.add(executorId)
   > ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options

2022-10-31 Thread GitBox


dtenedor commented on PR #38418:
URL: https://github.com/apache/spark/pull/38418#issuecomment-1297666241

   @cloud-fan now a different test flaked:
   
   
![image](https://user-images.githubusercontent.com/99207096/199107322-edb4aea8-40b3-42f6-b302-c577ae2083c6.png)
   
   I am going to re-try it agin...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] warrenzhu25 commented on pull request #38441: [SPARK-40481][CORE][FOLLOWUP] Keep removed executor info in decommission state

2022-10-31 Thread GitBox


warrenzhu25 commented on PR #38441:
URL: https://github.com/apache/spark/pull/38441#issuecomment-1297661121

   > No, I don't think that is a correct way because there exists a large Spark 
cluster. Please don't try to remove that line.
   
   What I mean is something like below, the memory added is constant regardless 
of cluster size.
   
   ```
   // limit the set size to 100 or 1000
   val executorsRemovedDueToDecom = new HashSet[String]
   
   executorsPendingDecommission.remove(executorId)
   executorsRemovedDueToDecom.add(executorId)
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #38441: [SPARK-40481][CORE][FOLLOWUP] Keep removed executor info in decommission state

2022-10-31 Thread GitBox


dongjoon-hyun commented on PR #38441:
URL: https://github.com/apache/spark/pull/38441#issuecomment-1297654097

   No, I don't think that is a correct way because there exists a large Spark 
cluster. Please don't try to remove this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] warrenzhu25 commented on pull request #38441: [SPARK-40481][CORE][FOLLOWUP] Keep removed executor info in decommission state

2022-10-31 Thread GitBox


warrenzhu25 commented on PR #38441:
URL: https://github.com/apache/spark/pull/38441#issuecomment-1297651992

   > If the original contribution 
[SPARK-40481](https://issues.apache.org/jira/browse/SPARK-40481) is 
insufficient for you still, we can revert it first before pursuing this kind of 
big change. We need to revisit from the bottom, @warrenzhu25 .
   
   The original PR works for most cases, but some lost executors due to 
decommission cannot be ignored as we lost such info so I want to improve this. 
I understand the concern of keeping removed executor. 
   
   What do you think of keeping only most recent 100 lost executors due to 
decommission in a separate set? @dongjoon-hyun  @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


grundprinzip commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009839486


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   So in fewer words :) when `num_partitions` is an integer the default value 
is `0` even if it's not and for scalar types we can't differentiate between 
present or not. Understanding if `0` is a valid or invalid value defeats the 
purpose.
   
   Thanks for the additional color!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. 
Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   The second point is an addition for the first point.  If there is a field 
which is not set, there could be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the 
default value that Spark uses. In that case we don't need to differentiate the 
optionality. Otherwise we need this way to differentiate `set versus not set`, 
to adopt default values of Spark (unless we don't care the default values in 
Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. 
Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   The second point is an addition for the first point.  If there is a field 
which is not set, there could be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the 
default value that Spark uses. In that case we don't need to differentiate the 
optionally. Otherwise we need this way to differentiate `set versus not set`, 
to adopt default values of Spark (unless we don't care the default values in 
Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009836536


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   To really answer your question: if we plan to respect default values for 
Spark for those optionally fields whose default proto values are different from 
Spark default values, this is the only way to respect default values for Spark. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##
@@ -207,3 +208,23 @@ message Sample {
 int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf 
"spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. 
Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   2. is an addition for 1.  If there is a field which is not set, there could 
be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the 
default value that Spark uses. In that case we don't need to differentiate the 
optionally. Otherwise we need this way to differentiate `set versus not set`, 
to adopt default values of Spark (unless we don't care the default values in 
Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] grundprinzip commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


grundprinzip commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009832423


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##
@@ -175,6 +178,28 @@ package object dsl {
   }
 
   object plans { // scalastyle:ignore
+implicit class DslMockRemoteSession(val session: MockRemoteSession) {
+  def range(
+  start: Option[Int],
+  end: Int,
+  step: Option[Int],
+  numPartitions: Option[Int]): Relation = {
+val range = proto.Range.newBuilder()

Review Comment:
   As long as no catalyst is in this package this is good with me. Thanks for 
clarifying.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

2022-10-31 Thread GitBox


amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009828791


##
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##
@@ -175,6 +178,28 @@ package object dsl {
   }
 
   object plans { // scalastyle:ignore
+implicit class DslMockRemoteSession(val session: MockRemoteSession) {
+  def range(
+  start: Option[Int],
+  end: Int,
+  step: Option[Int],
+  numPartitions: Option[Int]): Relation = {
+val range = proto.Range.newBuilder()

Review Comment:
   It makes sense for `SparkConnectPlanner` where Catalyst and Proto are both 
mixed together, and we are keeping the approach you are asking there.
   
   However this is the Connect DSL that only deal with protos. No Catalyst 
included in this package: 
https://github.com/apache/spark/blob/9fc3aa0b1c092ab1f13b26582e3ece7440fbfc3b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala#L17



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >