date:20230406

[GitHub] [spark] beliefer commented on pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



beliefer commented on PR #40697:
URL: https://github.com/apache/spark/pull/40697#issuecomment-149785

   @cloud-fan Thank you for ping me. What is the performance before or after 
this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #40688: [SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE

2023-04-06 Thread via GitHub



viirya commented on code in PR #40688:
URL: https://github.com/apache/spark/pull/40688#discussion_r1160479528


##
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala:
##
@@ -60,6 +61,7 @@ case class InsertAdaptiveSparkPlan(
   val subqueryMap = buildSubqueryMap(plan)
   val planSubqueriesRule = PlanAdaptiveSubqueries(subqueryMap)
   val preprocessingRules = Seq(
+CoalesceBucketsInJoin,
 planSubqueriesRule)

Review Comment:
   Hmm, why don't put into `queryStagePreparationRules`? These no-op physical 
rules including `DisableUnnecessaryBucketedScan` are put there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



yaooqinn commented on code in PR #40697:
URL: https://github.com/apache/spark/pull/40697#discussion_r1160479375


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -1808,6 +1808,13 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val USE_TASK_EVALUATOR = buildConf("spark.sql.execution.useTaskEvaluator")

Review Comment:
   Generate a helper for it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zsxwing commented on a diff in pull request #40561: [SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark

2023-04-06 Thread via GitHub



zsxwing commented on code in PR #40561:
URL: https://github.com/apache/spark/pull/40561#discussion_r1160479372


##
python/pyspark/sql/dataframe.py:
##
@@ -3928,6 +3928,71 @@ def dropDuplicates(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
 jdf = self._jdf.dropDuplicates(self._jseq(subset))
 return DataFrame(jdf, self.sparkSession)
 
+def dropDuplicatesWithinWatermark(self, subset: Optional[List[str]] = 
None) -> "DataFrame":
+"""Return a new :class:`DataFrame` with duplicate rows removed,
+ optionally only considering certain columns, within watermark.
+
+For a static batch :class:`DataFrame`, it just drops duplicate rows. 
For a streaming
+:class:`DataFrame`, this will keep all data across triggers as 
intermediate state to drop
+duplicated rows. The state will be kept to guarantee the semantic, 
"Events are deduplicated
+as long as the time distance of earliest and latest events are smaller 
than the delay
+threshold of watermark." The watermark for the input 
:class:`DataFrame` must be set via
+:func:`withWatermark`. Users are encouraged to set the delay threshold 
of watermark longer

Review Comment:
   It sounds weird to me that a method is called 
dropDuplicates**WithinWatermark** but I don't need to set the watermark.
   
   > For batch, there are a bunch of tools to perform deduplication, distinct / 
dropDuplicates / dropDuplicatesWithinWatermark. Most of batch use cases don't 
need to come up with using dropDuplicatesWithinWatermark.
   
   I think the common use case for `dropDuplicatesWithinWatermark` in batch is: 
develop the code in batch mode and switch to streaming later. In this case, 
catching potential issues in batch mode is better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin opened a new pull request, #40698: [SPARK-43062][INFRA][PYTHON][TESTS] Add options to lint-python to run each test separately

2023-04-06 Thread via GitHub



ueshin opened a new pull request, #40698:
URL: https://github.com/apache/spark/pull/40698

   ### What changes were proposed in this pull request?
   
   Add options to `lint-python` to run each test separately.
   
   ```
   lint-python [--compile] [--black] [--flake8] [--mypy] [--mypy-examples] 
[--mypy-data]
   ```
   
   ### Why are the changes needed?
   
   Running each test separately is sometimes useful during the development.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Manually run `lint-python` with options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40697:
URL: https://github.com/apache/spark/pull/40697#issuecomment-1499984943

   Got it. +1 for the testing plan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40663: [SPARK-39696][CORE] Fix data race in access to TaskMetrics.externalAccums

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40663:
URL: https://github.com/apache/spark/pull/40663#issuecomment-1499983771

   Please let me know if this is still valid in `branch-3.2`.
   > branch-3.3 and branch-3.2 may also require this one, they also use Scala 
2.13.8


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40663: [SPARK-39696][CORE] Fix data race in access to TaskMetrics.externalAccums

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40663:
URL: https://github.com/apache/spark/pull/40663#issuecomment-1499982811

   To @LuciferYang  and all.
   
   After double-checking, I found that Apache Spark 3.2.x is not affected 
because it's Scala 2.13.5. 
   
   
https://github.com/apache/spark/blob/7773740e4141444bf78ba75dcee9f3fade7f6e11/pom.xml#L3389
   
   SPARK-35496 (Scala 2.13.7) landed at Apache Spark 3.3.0+. We don't need to 
backport this to branch-3.2.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #40696: [SPARK-43056][SS] RocksDB state store commit should continue b/ground work only if its paused

2023-04-06 Thread via GitHub



HeartSaVioR commented on PR #40696:
URL: https://github.com/apache/spark/pull/40696#issuecomment-1499982120

   CI fails with odd errors.
   @anishshri-db Could you please rebase so that CI is retriggered? If the new 
trial fails again, maybe good to post to dev@ and see whether someone 
encountered this before, and/or someone is willing to volunteer to fix if 
that's not a transient issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #40693: [SPARK-43058] Move Numeric and Fractional to PhysicalDataType

2023-04-06 Thread via GitHub



amaliujia commented on PR #40693:
URL: https://github.com/apache/spark/pull/40693#issuecomment-1499981247

   @hvanhovell @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



cloud-fan commented on PR #40697:
URL: https://github.com/apache/spark/pull/40697#issuecomment-1499977213

   @dongjoon-hyun This is a bit of a low-level change and ideally we should run 
all end-to-end tests twice with the config on and off. However, it seems not 
worthwhile to double the test resource for this change. How about we turn it on 
by default after we have enough SQL operator coverage?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40688: [SPARK-43021][SQL] `CoalesceBucketsInJoin` not work when using AQE

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40688:
URL: https://github.com/apache/spark/pull/40688#issuecomment-1499974575

   cc @imback82, @cloud-fan , @viirya , @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



dongjoon-hyun commented on code in PR #40697:
URL: https://github.com/apache/spark/pull/40697#discussion_r1160458377


##
core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithEvaluatorRDD.scala:
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.{Partition, TaskContext, TaskEvaluatorFactory}
+
+class MapPartitionsWithEvaluatorRDD[T : ClassTag, U : ClassTag](

Review Comment:
   ~`private[spark]`?~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



dongjoon-hyun commented on code in PR #40697:
URL: https://github.com/apache/spark/pull/40697#discussion_r1160458377


##
core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithEvaluatorRDD.scala:
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.rdd
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.{Partition, TaskContext, TaskEvaluatorFactory}
+
+class MapPartitionsWithEvaluatorRDD[T : ClassTag, U : ClassTag](

Review Comment:
   `private[spark]`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-04-06 Thread via GitHub



cloud-fan commented on PR #40437:
URL: https://github.com/apache/spark/pull/40437#issuecomment-1499967555

   > Spark generates golden files for SQLQueryTestSuite;
   
   I think golden files there should match `df.show` instead of hive result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40523: [SPARK-42897][SQL] Avoid evaluate variables multiple times for SMJ and SHJ fullOuter join

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40523:
URL: https://github.com/apache/spark/pull/40523#discussion_r1160456984


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -364,10 +364,18 @@ case class ShuffledHashJoinExec(
  |${streamedKeyExprCode.code}
""".stripMargin
 val streamedKeyAnyNull = s"${streamedKeyExprCode.value}.anyNull()"
-
+// The streamedVars may be evaluated again in the following 
consumeFullOuterJoinRow method,
+// so generate the condition checking code with their copies.

Review Comment:
   ```suggestion
   // so generate the condition checking code with their copies, to avoid 
side effects (ExprCode is mutable).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40697:
URL: https://github.com/apache/spark/pull/40697#issuecomment-1499960023

   Also, cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40697:
URL: https://github.com/apache/spark/pull/40697#issuecomment-1499958805

   Thank you for pinging me, @cloud-fan .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub



cloud-fan commented on PR #40697:
URL: https://github.com/apache/spark/pull/40697#issuecomment-1499958423

   cc @viirya @dongjoon-hyun @yaooqinn @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan opened a new pull request, #40697: [SPARK-43061][SQL] Introduce TaskEvaluator for SQL operator execution

2023-04-06 Thread via GitHub

cloud-fan opened a new pull request, #40697:
URL: https://github.com/apache/spark/pull/40697

### What changes were proposed in this pull request?

This PR adds a new API `TaskEvaluator` to define the computing logic and
requires the caller side to explicitly list what needs to be serialized and
sent to executors via `TaskEvaluatorFactory`.

Two new RDD APIs are added to use `TaskEvaluator`:
```
/**
* Return a new RDD by applying an evaluator to each partition of this
RDD. The given evaluator
* factory will be serialized and sent to executors, and each task will
create an evaluator with
* the factory, and use the evaluator to transform the data of the input
partition.
*/
@DeveloperApi
@Since("3.5.0")
def mapPartitionsWithEvaluator[U: ClassTag](
taskEvaluatorFactory: TaskEvaluatorFactory[T, U]): RDD[U] = withScope {
new MapPartitionsWithEvaluatorRDD(this, taskEvaluatorFactory)
}

/**
* Zip this RDD's partitions with another RDD and return a new RDD by
applying an evaluator to
* the zipped partitions. Assumes that the two RDDs have the *same number
of partitions*, but
* does *not* require them to have the same number of elements in each
partition.
*/
@DeveloperApi
@Since("3.5.0")
def zipPartitionsWithEvaluator[U: ClassTag](
rdd2: RDD[T],
taskEvaluatorFactory: TaskEvaluatorFactory[T, U]): RDD[U] = withScope {
new ZippedPartitionsWithEvaluatorRDD(this, rdd2, taskEvaluatorFactory)
}
```

Three SQL operators are updated to use the new API to do execution, as a
showcase: Project, Filter, WholeStageCodegen. We can migrate more operators
later. A config is added to still use the old code path by default.

### Why are the changes needed?

Using lambda to define the computing logic is a bit tricky:
1. it's easy to mistakenly reference objects in the closure, which increases
the time to serialize the lambda and sent to executors. `ProjectExec` and
`FilterExec` use `child.output` in the lambda which means the entire `child`
will be serialized. There are other places trying to avoid this problem, e.g.
https://github.com/apache/spark/blob/v3.3.2/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L90-L92
2. serializing lambda is strongly discouraged by the [official Java
guide](https://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html#serialization).
We should eventually get rid of lambda during distributed execution to make
Spark more robust.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

existing tests

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40639: [SPARK-43007][BUILD] Upgrade rocksdbjni to 8.0.0

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40639:
URL: https://github.com/apache/spark/pull/40639#issuecomment-1499946937

   Thank you, @HeartSaVioR . To @LuciferYang , could you address the above 
comment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #40639: [SPARK-43007][BUILD] Upgrade rocksdbjni to 8.0.0

2023-04-06 Thread via GitHub



HeartSaVioR commented on PR #40639:
URL: https://github.com/apache/spark/pull/40639#issuecomment-1499945844

   Sorry for post-review. It'd be nice if you don't mind running below 
benchmark as well.
   
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/StateStoreBasicOperationsBenchmark.scala
   The test is more likely about performance of WriteBatch rather than RocksDB 
itself but that's something we use in RocksDB state store provider.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40639: [SPARK-43007][BUILD] Upgrade rocksdbjni to 8.0.0

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40639:
URL: https://github.com/apache/spark/pull/40639#issuecomment-1499944650

   Merged to master. Thank you for waiting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #40639: [SPARK-43007][BUILD] Upgrade rocksdbjni to 8.0.0

2023-04-06 Thread via GitHub



dongjoon-hyun closed pull request #40639: [SPARK-43007][BUILD] Upgrade 
rocksdbjni to 8.0.0
URL: https://github.com/apache/spark/pull/40639


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #40523: [SPARK-42897][SQL] Avoid evaluate variables multiple times for SMJ and SHJ fullOuter join

2023-04-06 Thread via GitHub



wankunde commented on code in PR #40523:
URL: https://github.com/apache/spark/pull/40523#discussion_r1160439594


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala:
##
@@ -95,4 +95,27 @@ trait JoinCodegenSupport extends CodegenSupport with 
BaseJoinExec {
   }
 }
   }
+
+  /**
+   * Splits variables based on whether it's used by condition or not, returns 
the code to create

Review Comment:
   Yes, move `splitVarsByCondition` method into JoinCodegenSupport, so it can 
be used in ShuffledHashJoinExec



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-04-06 Thread via GitHub



aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1160427447


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala:
##
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeReference, CreateNamedStruct, Expression, ExtractValue, 
GetStructField, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.plans.logical.Assignment
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
+
+object AssignmentUtils extends SQLConfHelper with CastSupport {
+
+  private case class ColumnUpdate(ref: Seq[String], expr: Expression)
+
+  /**
+   * Aligns assignments to match table columns.
+   * 
+   * This method processes and reorders given assignments so that each target 
column gets
+   * an expression it should be set to. If a column does not have a matching 
assignment,
+   * it will be set to its current value. For example, if one passes table 
attributes c1, c2
+   * and an assignment c2 = 1, this method will return c1 = c1, c2 = 1.
+   * 
+   * This method also handles updates to nested columns. If there is an 
assignment to a particular
+   * nested field, this method will construct a new struct with one field 
updated preserving other
+   * fields that have not been modified. For example, if one passes table 
attributes c1, c2
+   * where c2 is a struct with fields n1 and n2 and an assignment c2.n2 = 1, 
this method will
+   * return c1 = c1, c2 = struct(c2.n1, 1).
+   *
+   * @param attrs table attributes
+   * @param assignments assignments to align
+   * @return aligned assignments that match table columns
+   */
+  def alignAssignments(
+  attrs: Seq[Attribute],
+  assignments: Seq[Assignment]): Seq[Assignment] = {
+
+val errors = new mutable.ArrayBuffer[String]()
+
+val output = applyUpdates(
+  updates = assignments.map(toColumnUpdate),
+  cols = attrs.map(restoreActualType),
+  colExprs = attrs,
+  addError = err => errors += err)
+
+if (errors.nonEmpty) {
+  throw 
QueryCompilationErrors.invalidRowLevelOperationAssignments(assignments, 
errors.toSeq)
+}
+
+attrs.zip(output).map { case (attr, expr) => Assignment(attr, expr) }
+  }
+
+  private def toColumnUpdate(assignment: Assignment): ColumnUpdate = {
+ColumnUpdate(toRef(assignment.key), assignment.value)
+  }
+
+  private def restoreActualType(attr: Attribute): Attribute = {
+
attr.withDataType(CharVarcharUtils.getRawType(attr.metadata).getOrElse(attr.dataType))
+  }
+
+  private def applyUpdates(
+  updates: Seq[ColumnUpdate],
+  cols: Seq[Attribute],
+  colExprs: Seq[Expression],
+  addError: String => Unit,
+  colPath: Seq[String] = Nil): Seq[Expression] = {
+
+// iterate through columns at the current level and find matching updates
+cols.zip(colExprs).map { case (col, colExpr) =>
+  // find matches for this column or any of its children
+  val prefixMatchedUpdates = updates.filter(update => 
conf.resolver(update.ref.head, col.name))
+  prefixMatchedUpdates match {
+// if there is no exact match and no match for children, return the 
column expr as is
+case matchedUpdates if matchedUpdates.isEmpty =>
+  colExpr
+
+// if there is only one update and it is an exact match, return the 
assigned expression
+case Seq(matchedUpdate) if isExactMatch(matchedUpdate, col) =>
+  applyUpdate(matchedUpdate.expr, col, addError, colPath :+ col.name)
+
+// if there are matches only for children
+case matchedUpdates if !hasExactMatch(matchedUpdates, col) =>
+  val newColPath = colPath

[GitHub] [spark] ueshin commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-06 Thread via GitHub



ueshin commented on code in PR #40692:
URL: https://github.com/apache/spark/pull/40692#discussion_r1160421573


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala:
##
@@ -60,13 +61,19 @@ private[sql] class SparkResult[T](
   private def processResponses(stopOnFirstNonEmptyResponse: Boolean): Boolean 
= {
 while (responses.hasNext) {
   val response = responses.next()
+  if (response.hasSchema) {
+structType =

Review Comment:
   It is the original schema and the one in the arrow batch is modified to 
deduplicate the struct field names.
   Also the original schema contains UDT if it's supported. Python client works 
fine with that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #40685: [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions

2023-04-06 Thread via GitHub



wangyum commented on code in PR #40685:
URL: https://github.com/apache/spark/pull/40685#discussion_r1160419934


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -610,31 +610,21 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 aggregations: Seq[NamedExpression],
 groupByAliases: Seq[Alias],
 groupingAttrs: Seq[Expression],
-gid: Attribute): Seq[NamedExpression] = aggregations.map { agg =>
-  // collect all the found AggregateExpression, so we can check an 
expression is part of
-  // any AggregateExpression or not.
-  val aggsBuffer = ArrayBuffer[Expression]()
-  // Returns whether the expression belongs to any expressions in 
`aggsBuffer` or not.
-  def isPartOfAggregation(e: Expression): Boolean = {
-aggsBuffer.exists(a => a.exists(_ eq e))
-  }
-  replaceGroupingFunc(agg, groupByExprs, gid).transformDown {

Review Comment:
   `transformDown` will also transforms aggregate expressions. For expression 
`b`, the first should not be replaced, the second needs to be replaced. 
`transformDown` cannot avoid transform aggregate expressions:
   
![image](https://user-images.githubusercontent.com/5399861/230537708-ac818821-7e24-485d-b50a-e107a32657e8.png)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #40685: [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions

2023-04-06 Thread via GitHub



wangyum commented on code in PR #40685:
URL: https://github.com/apache/spark/pull/40685#discussion_r1160419934


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -610,31 +610,21 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 aggregations: Seq[NamedExpression],
 groupByAliases: Seq[Alias],
 groupingAttrs: Seq[Expression],
-gid: Attribute): Seq[NamedExpression] = aggregations.map { agg =>
-  // collect all the found AggregateExpression, so we can check an 
expression is part of
-  // any AggregateExpression or not.
-  val aggsBuffer = ArrayBuffer[Expression]()
-  // Returns whether the expression belongs to any expressions in 
`aggsBuffer` or not.
-  def isPartOfAggregation(e: Expression): Boolean = {
-aggsBuffer.exists(a => a.exists(_ eq e))
-  }
-  replaceGroupingFunc(agg, groupByExprs, gid).transformDown {

Review Comment:
   `transformDown` will also transforms aggregate expressions. For expression 
`b`, the first should not be replaced, the second needs to be replaced. 
`transformDown` can't do it:
   
![image](https://user-images.githubusercontent.com/5399861/230537708-ac818821-7e24-485d-b50a-e107a32657e8.png)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db commented on pull request #40696: [SPARK-43056][SS] RocksDB state store commit should continue b/ground work only if its paused

2023-04-06 Thread via GitHub



anishshri-db commented on PR #40696:
URL: https://github.com/apache/spark/pull/40696#issuecomment-1499909905

   @HeartSaVioR - PTAL. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] anishshri-db opened a new pull request, #40696: [SPARK-43056] RocksDB state store commit should continue b/ground work only if its paused

2023-04-06 Thread via GitHub



anishshri-db opened a new pull request, #40696:
URL: https://github.com/apache/spark/pull/40696

   ### What changes were proposed in this pull request?
   RocksDB state store commit should continue background work in finally only 
if its paused
   
   ### Why are the changes needed?
   If an exception is thrown earlier in the commit sequence before background 
work is paused, we fail with an exception in the finally clause.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Ran unit tests
   ```
   StateStoreIntegrationSuite
   [info] Run completed in 16 seconds, 131 milliseconds.
   [info] Total number of tests run: 5
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   ```
   
   ```
   StateStoreSuite
   [info] Run completed in 1 minute, 18 seconds.


  [info] Total number of tests run: 
73
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 73, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   ```
   
   With simulated exception and without the fix, the code crashes with stack 
trace below. With the fix, we don't see the issue in the finally block.
   ```
 org.rocksdb.RocksDBException: InvalidArgument
 at org.rocksdb.RocksDB.continueBackgroundWork(Native Method)
 at org.rocksdb.RocksDB.continueBackgroundWork(RocksDB.java:3611)
 at 
org.apache.spark.sql.execution.streaming.state.RocksDB.commit(RocksDB.scala:421)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40683: [SPARK-43049][SQL] Use CLOB instead of VARCHAR(255) for StringType for Oracle JDBC

2023-04-06 Thread via GitHub



dongjoon-hyun commented on code in PR #40683:
URL: https://github.com/apache/spark/pull/40683#discussion_r1160419413


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -106,17 +106,26 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
 var t = spark.table(tbl)
 var expectedSchema = new StructType().add("ID", DecimalType(10, 0), true, 
defaultMetadata)
 assert(t.schema === expectedSchema)
-sql(s"ALTER TABLE $tbl ALTER COLUMN id TYPE STRING")
+sql(s"ALTER TABLE $tbl ALTER COLUMN id TYPE LONG")
 t = spark.table(tbl)
-expectedSchema = new StructType().add("ID", StringType, true, 
defaultMetadata)
+expectedSchema = new StructType().add("ID", DecimalType(19, 0), true, 
defaultMetadata)

Review Comment:
   Got it. Thank you, @yaooqinn .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40663: [SPARK-39696][CORE] Fix data race in access to TaskMetrics.externalAccums

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40663:
URL: https://github.com/apache/spark/pull/40663#issuecomment-1499907848

   Thank you, @eejbyfeldt , @mridulm , @HyukjinKwon , and @LuciferYang .
   I also backported to branch-3.3.
   
   However, we need a new PR for branch-3.2 due to the compilation error, 
@LuciferYang and @eejbyfeldt .
   ```
   [error] 
/Users/dongjoon/APACHE/spark-merge/core/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala:33:19:
 object logging is not a member of package org.apache
   [error] import org.apache.logging.log4j._
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #40683: [SPARK-43049][SQL] Use CLOB instead of VARCHAR(255) for StringType for Oracle JDBC

2023-04-06 Thread via GitHub



yaooqinn commented on code in PR #40683:
URL: https://github.com/apache/spark/pull/40683#discussion_r1160416286


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -106,17 +106,26 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
 var t = spark.table(tbl)
 var expectedSchema = new StructType().add("ID", DecimalType(10, 0), true, 
defaultMetadata)
 assert(t.schema === expectedSchema)
-sql(s"ALTER TABLE $tbl ALTER COLUMN id TYPE STRING")
+sql(s"ALTER TABLE $tbl ALTER COLUMN id TYPE LONG")
 t = spark.table(tbl)
-expectedSchema = new StructType().add("ID", StringType, true, 
defaultMetadata)
+expectedSchema = new StructType().add("ID", DecimalType(19, 0), true, 
defaultMetadata)

Review Comment:
   Yes, Oracle Built-in integrals are `NUMBER(p, 0)`, the `LONG` there is a 
deprecated alternative for character strings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40555: [SPARK-42926][BUILD][SQL] Upgrade Parquet to 1.13.0

2023-04-06 Thread via GitHub



dongjoon-hyun commented on PR #40555:
URL: https://github.com/apache/spark/pull/40555#issuecomment-1499901243

   Thank you for updating.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #40662: [SPARK-43030][SQL] Deduplicate relations with metadata columns

2023-04-06 Thread via GitHub



cloud-fan closed pull request #40662: [SPARK-43030][SQL] Deduplicate relations 
with metadata columns
URL: https://github.com/apache/spark/pull/40662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40662: [SPARK-43030][SQL] Deduplicate relations with metadata columns

2023-04-06 Thread via GitHub



cloud-fan commented on PR #40662:
URL: https://github.com/apache/spark/pull/40662#issuecomment-1499900796

   thanks for the review, merging to master (it doesn't fix any actual bug so 
no need to backport)!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40683: [SPARK-43049][SQL] Use CLOB instead of VARCHAR(255) for StringType for Oracle JDBC

2023-04-06 Thread via GitHub



dongjoon-hyun commented on code in PR #40683:
URL: https://github.com/apache/spark/pull/40683#discussion_r1160414036


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala:
##
@@ -106,17 +106,26 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCTes
 var t = spark.table(tbl)
 var expectedSchema = new StructType().add("ID", DecimalType(10, 0), true, 
defaultMetadata)
 assert(t.schema === expectedSchema)
-sql(s"ALTER TABLE $tbl ALTER COLUMN id TYPE STRING")
+sql(s"ALTER TABLE $tbl ALTER COLUMN id TYPE LONG")
 t = spark.table(tbl)
-expectedSchema = new StructType().add("ID", StringType, true, 
defaultMetadata)
+expectedSchema = new StructType().add("ID", DecimalType(19, 0), true, 
defaultMetadata)

Review Comment:
   `LONG` is `DecimalType(19, 0)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #39170: [SPARK-41674][SQL] Runtime filter should supports multi level shuffle join side as filter creation side

2023-04-06 Thread via GitHub



beliefer commented on PR #39170:
URL: https://github.com/apache/spark/pull/39170#issuecomment-1499898495

   ping @cloud-fan Could you have time to take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40685: [SPARK-43050][SQL] Fix construct aggregate expressions by replacing grouping functions

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40685:
URL: https://github.com/apache/spark/pull/40685#discussion_r1160411387


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -610,31 +610,21 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 aggregations: Seq[NamedExpression],
 groupByAliases: Seq[Alias],
 groupingAttrs: Seq[Expression],
-gid: Attribute): Seq[NamedExpression] = aggregations.map { agg =>
-  // collect all the found AggregateExpression, so we can check an 
expression is part of
-  // any AggregateExpression or not.
-  val aggsBuffer = ArrayBuffer[Expression]()
-  // Returns whether the expression belongs to any expressions in 
`aggsBuffer` or not.
-  def isPartOfAggregation(e: Expression): Boolean = {
-aggsBuffer.exists(a => a.exists(_ eq e))
-  }
-  replaceGroupingFunc(agg, groupByExprs, gid).transformDown {

Review Comment:
   why is `transformDown` not working?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40563: [SPARK-41233][FOLLOWUP] Refactor `array_prepend` with `RuntimeReplaceable`

2023-04-06 Thread via GitHub



beliefer commented on PR #40563:
URL: https://github.com/apache/spark/pull/40563#issuecomment-1499895556

   > After a second thought, this makes the performance worse. We should 
improve `ArrayInsert` to generate more efficient code if the position argument 
is constant. For example, if the position argument is constant 1, then we don't 
need to generate code to check if position is negative or not.
   
   You means create another PR to simplify the code of `ArrayInsert`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40563: [SPARK-41233][FOLLOWUP] Refactor `array_prepend` with `RuntimeReplaceable`

2023-04-06 Thread via GitHub



beliefer commented on PR #40563:
URL: https://github.com/apache/spark/pull/40563#issuecomment-1499895163

   > Can we update the PR description? `array_append` is untouched now.
   
   Updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



beliefer commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160410345


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -138,6 +139,12 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   "c0 money)").executeUpdate()
 conn.prepareStatement("INSERT INTO money_types VALUES " +
   "('$1,000.00')").executeUpdate()
+
+conn.prepareStatement(s"CREATE TABLE timestamp_ntz(v 
timestamp)").executeUpdate()

Review Comment:
   > It is actually the same case with H2 as well. `timestamp == timestamp 
without timezone` http://www.h2database.com/html/datatypes.html#timestamp_type
   
   Yeah. I forgot it. Because H2 dialect using the mapping `Spark timestampNTZ 
<-> H2 timestamp without timezone`, so I think Postgres dialect should follows 
the behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #40684: [SPARK-41532][CONNECT][CLIENT] Add check for operations that involve multiple data frames

2023-04-06 Thread via GitHub



Hisoka-X commented on code in PR #40684:
URL: https://github.com/apache/spark/pull/40684#discussion_r1160409014


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -69,6 +69,9 @@ class SparkSession private[sql] (
 
   private[this] val allocator = new RootAllocator()
 
+  // a unique session ID for this session from client.
+  private[sql] def sessionId: String = client.sessionId

Review Comment:
   Good advise, should I change it for this PR? @HyukjinKwon @hvanhovell 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160408842


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -138,6 +139,12 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   "c0 money)").executeUpdate()
 conn.prepareStatement("INSERT INTO money_types VALUES " +
   "('$1,000.00')").executeUpdate()
+
+conn.prepareStatement(s"CREATE TABLE timestamp_ntz(v 
timestamp)").executeUpdate()

Review Comment:
   It is actually the same case with H2 as well. `timestamp == timestamp 
without timezone` http://www.h2database.com/html/datatypes.html#timestamp_type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40690: [SPARK-43043][CORE] Improve the performance of MapOutputTracker.updateMapOutput

2023-04-06 Thread via GitHub



mridulm commented on code in PR #40690:
URL: https://github.com/apache/spark/pull/40690#discussion_r1160407817


##
core/src/main/scala/org/apache/spark/MapOutputTracker.scala:
##
@@ -157,22 +164,29 @@ private class ShuffleStatus(
   invalidateSerializedMapOutputStatusCache()
 }
 mapStatuses(mapIndex) = status
+mapIdToMapIndex(status.mapId) = mapIndex
   }
 
   /**
* Update the map output location (e.g. during migration).
*/
   def updateMapOutput(mapId: Long, bmAddress: BlockManagerId): Unit = 
withWriteLock {
 try {
-  val mapStatusOpt = mapStatuses.find(x => x != null && x.mapId == mapId)
+  // OpenHashMap would return 0 if the key doesn't exist.
+  val mapIndex = if (mapIdToMapIndex.contains(mapId)) {
+Some(mapIdToMapIndex(mapId))
+  } else {
+None
+  }

Review Comment:
   Thoughts on changing this to either:
   * Add a `get(key): Option[V]` to `OpenHashMap` or
   * Simply do `Option(mapIdToMapIndex(mapId))` ?
   
   The second option relies on implementation detail that we return `null` when 
`apply` does not find a `key` - while the first option is more principled 
change.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160408278


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -138,6 +139,12 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   "c0 money)").executeUpdate()
 conn.prepareStatement("INSERT INTO money_types VALUES " +
   "('$1,000.00')").executeUpdate()
+
+conn.prepareStatement(s"CREATE TABLE timestamp_ntz(v 
timestamp)").executeUpdate()

Review Comment:
   Yes, in Postgres `timestamp` is equivalent to timestamp without time zone.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160408228


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala:
##
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)

Review Comment:
   In Postgres, `timestamp` is equivalent to timestamp without time zone.
   It has `timestamptz` to represent timestamp with time zone.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160408076


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala:
##
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)

Review Comment:
   almost every database has timestamp ntz.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160407865


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -104,6 +105,23 @@ abstract class JdbcDialect extends Serializable with 
Logging {
*/
   def getJDBCType(dt: DataType): Option[JdbcType] = None
 
+  /**
+   * Convert java.sql.Timestamp to a Long value (internal representation of a 
TimestampNTZType)
+   * holding the microseconds since the epoch of 1970-01-01 00:00:00Z for this 
timestamp.
+   */
+  def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.fromJavaTimestampNoRebase(t)
+  }
+
+  /**
+   * Converts a LocalDateTime representing a TimestampNTZ type to an
+   * instance of `java.sql.Timestamp`.
+   */
+  def convertTimestampNTZToJavaTimestamp(ldt: LocalDateTime): Timestamp = {

Review Comment:
   People can use `LocalDateTime.ofEpochSecond`, no need to specify year, 
month, day, ... fields



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160407528


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -379,4 +386,32 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 assert(row(0).length === 1)
 assert(row(0).getString(0) === "$1,000.00")
   }
+
+  test("SPARK-43040: timestamp_ntz read test") {
+val prop = new Properties
+prop.setProperty("preferTimestampNTZ", "true")
+val df = sqlContext.read.jdbc(jdbcUrl, "timestamp_ntz", prop)
+val row = df.collect()

Review Comment:
   ah I see



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160407471


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -138,6 +139,12 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   "c0 money)").executeUpdate()
 conn.prepareStatement("INSERT INTO money_types VALUES " +
   "('$1,000.00')").executeUpdate()
+
+conn.prepareStatement(s"CREATE TABLE timestamp_ntz(v 
timestamp)").executeUpdate()

Review Comment:
   isn't it the default of `timestamp`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



beliefer commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160406281


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -138,6 +139,12 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   "c0 money)").executeUpdate()
 conn.prepareStatement("INSERT INTO money_types VALUES " +
   "('$1,000.00')").executeUpdate()
+
+conn.prepareStatement(s"CREATE TABLE timestamp_ntz(v 
timestamp)").executeUpdate()

Review Comment:
   It seems Postgres have timestamp without time zone. I think we can mapping 
timestamp -> timestamp, timestampNTZ -> timestampNTZ.
   H2 does not have timestampNTZ.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40415: [Do not merge] Add JDBC to DataFrameWriter

2023-04-06 Thread via GitHub



beliefer commented on PR #40415:
URL: https://github.com/apache/spark/pull/40415#issuecomment-1499887231

   > @beliefer is there a chance you can this one moving again?
   
   But https://github.com/apache/spark/pull/40358 already do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #40683: [SPARK-43049][SQL] Use CLOB instead of VARCHAR(255) for StringType for Oracle JDBC

2023-04-06 Thread via GitHub



yaooqinn commented on PR #40683:
URL: https://github.com/apache/spark/pull/40683#issuecomment-1499880602

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

2023-04-06 Thread via GitHub



zhengruifeng opened a new pull request, #40695:
URL: https://github.com/apache/spark/pull/40695

   ### What changes were proposed in this pull request?
   PyTorch Distributor support Local Mode with GPU
   
   
   ### Why are the changes needed?
   for functionality parity
   after this PR, all UTs in `test_distributor` are reused in Connect
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new mode supported in Connect
   
   
   ### How was this patch tested?
   enabled UTs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40693: [SPARK-43058] Move Numeric to PhysicalDataType

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40693:
URL: https://github.com/apache/spark/pull/40693#discussion_r1160400552


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala:
##
@@ -23,7 +23,7 @@ import scala.reflect.runtime.universe.typeTag
 import org.apache.spark.sql.catalyst.expressions.{Ascending, BoundReference, 
InterpretedOrdering, SortOrder}
 import org.apache.spark.sql.catalyst.util.{ArrayData, SQLOrderingUtil}
 import org.apache.spark.sql.errors.QueryExecutionErrors
-import org.apache.spark.sql.types.{ArrayType, BinaryType, BooleanType, 
ByteType, DataType, DateType, DayTimeIntervalType, Decimal, DecimalType, 
DoubleType, FloatType, IntegerType, LongType, MapType, NullType, ShortType, 
StringType, StructField, StructType, TimestampNTZType, TimestampType, 
YearMonthIntervalType}
+import org.apache.spark.sql.types.{ArrayType, BinaryType, BooleanType, 
ByteExactNumeric, ByteType, DataType, DateType, DayTimeIntervalType, Decimal, 
DecimalExactNumeric, DecimalType, DoubleExactNumeric, DoubleType, 
FloatExactNumeric, FloatType, IntegerExactNumeric, IntegerType, 
LongExactNumeric, LongType, MapType, NullType, NumericType, ShortExactNumeric, 
ShortType, StringType, StructField, StructType, TimestampNTZType, 
TimestampType, YearMonthIntervalType}

Review Comment:
   We also have to move `*ExactNumeric` classes at some point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn closed pull request #40683: [SPARK-43049][SQL] Use CLOB instead of VARCHAR(255) for StringType for Oracle JDBC

2023-04-06 Thread via GitHub



yaooqinn closed pull request #40683: [SPARK-43049][SQL] Use CLOB instead of 
VARCHAR(255) for StringType for Oracle JDBC
URL: https://github.com/apache/spark/pull/40683


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



yaooqinn commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160399361


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -138,6 +139,12 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   "c0 money)").executeUpdate()
 conn.prepareStatement("INSERT INTO money_types VALUES " +
   "('$1,000.00')").executeUpdate()
+
+conn.prepareStatement(s"CREATE TABLE timestamp_ntz(v 
timestamp)").executeUpdate()

Review Comment:
   can we also add a column with `timestamp without time zone`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #40694: [SPARK-42057][CONNECT][PYTHON] Migrate Spark Connect Column errors into error class

2023-04-06 Thread via GitHub



itholic opened a new pull request, #40694:
URL: https://github.com/apache/spark/pull/40694

   ### What changes were proposed in this pull request?
   
   This PR proposes to migrate Spark Connect Column errors into error class.
   
   ### Why are the changes needed?
   
   Leveraging PySpark error framework.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, it's error message improvements.
   
   
   ### How was this patch tested?
   
   The existing CI should pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia opened a new pull request, #40693: [SPARK-43058] Move Numeric to PhysicalDataType

2023-04-06 Thread via GitHub



amaliujia opened a new pull request, #40693:
URL: https://github.com/apache/spark/pull/40693

   
   
   ### What changes were proposed in this pull request?
   
   This PR proposes that we start to move Numeric to PhysicalDataType. This is 
to simplify the DataType class to make it become a simple interface without 
coupling too many internal representations.
   
   ### Why are the changes needed?
   
   To make DataType become a simpler interface, non-public code can be moved 
outside of the DataType class.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40523: [SPARK-42897][SQL] Avoid evaluate variables multiple times for SMJ and SHJ fullOuter join

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40523:
URL: https://github.com/apache/spark/pull/40523#discussion_r1160396781


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala:
##
@@ -1036,8 +1013,14 @@ case class SortMergeJoinExec(
 val rightResultVars = genOneSideJoinVars(
   ctx, rightOutputRow, right, setDefaultValue = true)
 val resultVars = leftResultVars ++ rightResultVars
-val (_, conditionCheck, _) =
-  getJoinCondition(ctx, leftResultVars, left, right, Some(rightOutputRow))
+val copiedLeftResultVars = leftResultVars.map(v => v.copy())

Review Comment:
   ditto



##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala:
##
@@ -1036,8 +1013,14 @@ case class SortMergeJoinExec(
 val rightResultVars = genOneSideJoinVars(
   ctx, rightOutputRow, right, setDefaultValue = true)
 val resultVars = leftResultVars ++ rightResultVars
-val (_, conditionCheck, _) =
-  getJoinCondition(ctx, leftResultVars, left, right, Some(rightOutputRow))
+val copiedLeftResultVars = leftResultVars.map(v => v.copy())
+val (leftBefore, _) = splitVarsByCondition(left.output, 
copiedLeftResultVars)
+val (_, conditionCheckWithoutLeftVars, _) =
+  getJoinCondition(ctx, copiedLeftResultVars, left, right, 
Some(rightOutputRow))
+val conditionCheck =

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40523: [SPARK-42897][SQL] Avoid evaluate variables multiple times for SMJ and SHJ fullOuter join

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40523:
URL: https://github.com/apache/spark/pull/40523#discussion_r1160396428


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -364,10 +364,15 @@ case class ShuffledHashJoinExec(
  |${streamedKeyExprCode.code}
""".stripMargin
 val streamedKeyAnyNull = s"${streamedKeyExprCode.value}.anyNull()"
-
+val copiedStreamedVars = streamedVars.map(v => v.copy())
+val (streamedBefore, _) = splitVarsByCondition(streamedOutput, 
copiedStreamedVars)
 // Generate code for join condition
-val (_, conditionCheck, _) =
-  getJoinCondition(ctx, streamedVars, streamedPlan, buildPlan, 
Some(buildRow))
+val (_, conditionCheckWithoutStreamVars, _) =
+  getJoinCondition(ctx, copiedStreamedVars, streamedPlan, buildPlan, 
Some(buildRow))
+val conditionCheck =
+  s"""$streamedBefore

Review Comment:
   nit:
   ```
   s"""
  |...
  |...
  |""". stripMargin
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40523: [SPARK-42897][SQL] Avoid evaluate variables multiple times for SMJ and SHJ fullOuter join

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40523:
URL: https://github.com/apache/spark/pull/40523#discussion_r1160396078


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -364,10 +364,15 @@ case class ShuffledHashJoinExec(
  |${streamedKeyExprCode.code}
""".stripMargin
 val streamedKeyAnyNull = s"${streamedKeyExprCode.value}.anyNull()"
-
+val copiedStreamedVars = streamedVars.map(v => v.copy())

Review Comment:
   can we add some comments to explain why do we copy here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40684: [SPARK-41532][CONNECT][CLIENT] Add check for operations that involve multiple data frames

2023-04-06 Thread via GitHub



amaliujia commented on code in PR #40684:
URL: https://github.com/apache/spark/pull/40684#discussion_r1160396036


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -69,6 +69,9 @@ class SparkSession private[sql] (
 
   private[this] val allocator = new RootAllocator()
 
+  // a unique session ID for this session from client.
+  private[sql] def sessionId: String = client.sessionId

Review Comment:
   
https://github.com/apache/spark/blob/fa6e55bba4d62cdad66e5f425d8a261fe1050134/connector/connect/common/src/main/protobuf/spark/connect/base.proto#L48
 this is the user_id of a client.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40523: [SPARK-42897][SQL] Avoid evaluate variables multiple times for SMJ and SHJ fullOuter join

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40523:
URL: https://github.com/apache/spark/pull/40523#discussion_r1160396011


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##
@@ -364,10 +364,15 @@ case class ShuffledHashJoinExec(
  |${streamedKeyExprCode.code}
""".stripMargin
 val streamedKeyAnyNull = s"${streamedKeyExprCode.value}.anyNull()"
-
+val copiedStreamedVars = streamedVars.map(v => v.copy())
+val (streamedBefore, _) = splitVarsByCondition(streamedOutput, 
copiedStreamedVars)
 // Generate code for join condition
-val (_, conditionCheck, _) =
-  getJoinCondition(ctx, streamedVars, streamedPlan, buildPlan, 
Some(buildRow))
+val (_, conditionCheckWithoutStreamVars, _) =

Review Comment:
   actually I think the new name is a bit confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40684: [SPARK-41532][CONNECT][CLIENT] Add check for operations that involve multiple data frames

2023-04-06 Thread via GitHub



amaliujia commented on code in PR #40684:
URL: https://github.com/apache/spark/pull/40684#discussion_r1160395678


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -69,6 +69,9 @@ class SparkSession private[sql] (
 
   private[this] val allocator = new RootAllocator()
 
+  // a unique session ID for this session from client.
+  private[sql] def sessionId: String = client.sessionId

Review Comment:
   Nit: this to be more accurate should be `user_id+session_id`. However 
because session_id itself is a random number thus `client.sessionId` should 
still be able to identify SparkSession uniquely.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40523: [SPARK-42897][SQL] Avoid evaluate variables multiple times for SMJ and SHJ fullOuter join

2023-04-06 Thread via GitHub



cloud-fan commented on code in PR #40523:
URL: https://github.com/apache/spark/pull/40523#discussion_r1160395746


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/JoinCodegenSupport.scala:
##
@@ -95,4 +95,27 @@ trait JoinCodegenSupport extends CodegenSupport with 
BaseJoinExec {
   }
 }
   }
+
+  /**
+   * Splits variables based on whether it's used by condition or not, returns 
the code to create

Review Comment:
   I assume this just moves the code without any change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1160395012


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##
@@ -78,7 +79,7 @@ case class BloomFilterAggregate(
 "exprName" -> "estimatedNumItems or numBits"
   )
 )
-  case (LongType, LongType, LongType) =>
+  case (LongType, LongType, LongType) | (StringType, LongType, LongType) =>

Review Comment:
   Alternatively: `case (LongType | StringType, LongType, LongType) =>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #40684: [SPARK-41532][CONNECT][CLIENT] Add check for operations that involve multiple data frames

2023-04-06 Thread via GitHub



amaliujia commented on PR #40684:
URL: https://github.com/apache/spark/pull/40684#issuecomment-1499869942

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1160394607


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1154,6 +1155,91 @@ class SparkConnectPlanner(val session: SparkSession) {
 }
 Some(Lead(children.head, children(1), children(2), ignoreNulls))
 
+  case "bloom_filter_agg" if fun.getArgumentsCount == 5 =>
+// [col, catalogString: String, expectedNumItems: Long, numBits: Long, 
fpp: Double]
+val children = 
fun.getArgumentsList.asScala.toSeq.map(transformExpression)
+val dt = {
+  val ddl = children(1) match {
+case StringLiteral(s) => s
+case other =>
+  throw InvalidPlanInput(s"col dataType should be a literal 
string, but got $other")
+  }
+  DataType.fromDDL(ddl)
+}
+val col = dt match {

Review Comment:
   I wonder if we can solve this by special casing implicit type conversions in 
BloomFilterAggregate instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1160392895


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1073,6 +1074,91 @@ class SparkConnectPlanner(val session: SparkSession) {
 }
 Some(Lead(children.head, children(1), children(2), ignoreNulls))
 
+  case "bloom_filter_agg" if fun.getArgumentsCount == 5 =>
+// [col, catalogString: String, expectedNumItems: Long, numBits: Long, 
fpp: Double]
+val children = 
fun.getArgumentsList.asScala.toSeq.map(transformExpression)
+val dt = {
+  val ddl = children(1) match {
+case StringLiteral(s) => s
+case other =>
+  throw InvalidPlanInput(s"col dataType should be a literal 
string, but got $other")
+  }
+  DataType.fromDDL(ddl)
+}
+val col = dt match {
+  case IntegerType | ShortType | ByteType => Cast(children.head, 
LongType)
+  case LongType | StringType => children.head
+  case other =>
+throw InvalidPlanInput(
+  s"Bloom filter only supports integral types, " +
+s"and does not support type $other.")
+}
+
+val fpp = children(4) match {
+  case DoubleLiteral(d) => d
+  case _ =>
+throw InvalidPlanInput("False positive must be double literal.")
+}
+
+if (fpp.isNaN) {
+  // Use expectedNumItems and numBits when `fpp.isNaN` if true.
+  // Check expectedNumItems > 0L
+  val expectedNumItemsExpr = children(2)
+  val expectedNumItems = expectedNumItemsExpr match {
+case Literal(l: Long, LongType) => l
+case _ =>
+  throw InvalidPlanInput("Expected insertions must be long 
literal.")
+  }
+  if (expectedNumItems <= 0L) {
+throw InvalidPlanInput("Expected insertions must be positive.")
+  }
+  // Check numBits > 0L
+  val numBitsExpr = children(3)
+  val numBits = numBitsExpr match {
+case Literal(l: Long, LongType) => l
+case _ =>
+  throw InvalidPlanInput("Number of bits must be long literal.")
+  }
+  if (numBits <= 0L) {
+throw InvalidPlanInput("Number of bits must be positive.")
+  }
+  // Create BloomFilterAggregate with expectedNumItemsExpr and 
numBitsExpr.
+  Some(
+new BloomFilterAggregate(col, expectedNumItemsExpr, numBitsExpr)
+  .toAggregateExpression())
+
+} else {
+  def optimalNumOfBits(n: Long, p: Double): Long =

Review Comment:
   Not public just private[spark].



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1160392526


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##
@@ -1154,6 +1155,91 @@ class SparkConnectPlanner(val session: SparkSession) {
 }
 Some(Lead(children.head, children(1), children(2), ignoreNulls))
 
+  case "bloom_filter_agg" if fun.getArgumentsCount == 5 =>

Review Comment:
   TBH I think we need to get rid of all this manual function resolution, and 
try to fold it back into the FunctionRegistry.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40352: [SPARK-42664][CONNECT] Support `bloomFilter` function for `DataFrameStatFunctions`

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40352:
URL: https://github.com/apache/spark/pull/40352#discussion_r1160389178


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala:
##
@@ -584,6 +585,97 @@ final class DataFrameStatFunctions private[sql] 
(sparkSession: SparkSession, roo
 }
 CountMinSketch.readFrom(ds.head())
   }
+
+  /**
+   * Builds a Bloom filter over a specified column.
+   *
+   * @param colName
+   *   name of the column over which the filter is built
+   * @param expectedNumItems
+   *   expected number of items which will be put into the filter.
+   * @param fpp
+   *   expected false positive probability of the filter.
+   * @since 3.4.0
+   */
+  def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): 
BloomFilter = {
+buildBloomFilter(Column(colName), expectedNumItems, -1L, fpp)
+  }
+
+  /**
+   * Builds a Bloom filter over a specified column.
+   *
+   * @param col
+   *   the column over which the filter is built
+   * @param expectedNumItems
+   *   expected number of items which will be put into the filter.
+   * @param fpp
+   *   expected false positive probability of the filter.
+   * @since 3.4.0
+   */
+  def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): 
BloomFilter = {
+buildBloomFilter(col, expectedNumItems, -1L, fpp)
+  }
+
+  /**
+   * Builds a Bloom filter over a specified column.
+   *
+   * @param colName
+   *   name of the column over which the filter is built
+   * @param expectedNumItems
+   *   expected number of items which will be put into the filter.
+   * @param numBits
+   *   expected number of bits of the filter.
+   * @since 3.4.0
+   */
+  def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): 
BloomFilter = {
+buildBloomFilter(Column(colName), expectedNumItems, numBits, Double.NaN)
+  }
+
+  /**
+   * Builds a Bloom filter over a specified column.
+   *
+   * @param col
+   *   the column over which the filter is built
+   * @param expectedNumItems
+   *   expected number of items which will be put into the filter.
+   * @param numBits
+   *   expected number of bits of the filter.
+   * @since 3.4.0
+   */
+  def bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): 
BloomFilter = {
+buildBloomFilter(col, expectedNumItems, numBits, Double.NaN)
+  }
+
+  private def buildBloomFilter(
+  col: Column,
+  expectedNumItems: Long,
+  numBits: Long,
+  fpp: Double): BloomFilter = {
+
+val dataType = sparkSession

Review Comment:
   Can we use the `TypeOf` expression for this instead? Alternatively we can 
try to figure this out in the planner.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikf commented on pull request #40437: [SPARK-41259][SQL] SparkSQLDriver Output schema and result string should be consistent

2023-04-06 Thread via GitHub



Yikf commented on PR #40437:
URL: https://github.com/apache/spark/pull/40437#issuecomment-1499861262

   > Seems like we just need a small change: only invoke `hiveResultString` if 
a testing config is true, and we only set that testing config in 
`HiveComparisonTest`.
   
   Return the result as a hive compatible sequence of strings of 
`hiveResultString`  currently does two things:
   1. `HiveComparisonTest` used to compare compatibility with hive output;
   2. Spark generates golden files for `SQLQueryTestSuite`;
   
   2 is a friendly form of output, and I think we should keep it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40415: [Do not merge] Add JDBC to DataFrameWriter

2023-04-06 Thread via GitHub



hvanhovell commented on PR #40415:
URL: https://github.com/apache/spark/pull/40415#issuecomment-1499858366

   @beliefer is there a chance you can this one moving again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40692:
URL: https://github.com/apache/spark/pull/40692#discussion_r1160385446


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala:
##
@@ -60,13 +61,19 @@ private[sql] class SparkResult[T](
   private def processResponses(stopOnFirstNonEmptyResponse: Boolean): Boolean 
= {
 while (responses.hasNext) {
   val response = responses.next()
+  if (response.hasSchema) {
+structType =

Review Comment:
   What is the difference between this schema and the one in the arrow batch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40654: [SPARK-43022][CONNECT] Support protobuf functions for Scala client

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40654:
URL: https://github.com/apache/spark/pull/40654#discussion_r1160384178


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala:
##
@@ -863,6 +866,46 @@ class ClientE2ETestSuite extends RemoteSparkSession with 
SQLHelper {
 }.getMessage
 assert(message.contains("PARSE_SYNTAX_ERROR"))
   }
+
+  test("protobuf functions") {
+assert(IntegrationTestUtils.isSparkProtobufJarAvailable)
+// scalastyle:off line.size.limit
+// If `common.desc` needs to be updated, execute the following command to 
regenerate it:
+//  1. cd connector/connect/common/src/main/protobuf/spark/connect
+//  2. protoc --include_imports 
--descriptor_set_out=../../../../test/resources/protobuf-tests/common.desc 
common.proto
+// scalastyle:on line.size.limit
+val descPath = {

Review Comment:
   Where is that absolute file path coming from?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40654: [SPARK-43022][CONNECT] Support protobuf functions for Scala client

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40654:
URL: https://github.com/apache/spark/pull/40654#discussion_r1160383595


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/IntegrationTestUtils.scala:
##
@@ -57,6 +57,16 @@ object IntegrationTestUtils {
 Files.exists(Paths.get(filePath))
   }
 
+  private[sql] lazy val protobufJarPath: Option[File] = {
+try {
+  Some(findJar("connector/protobuf", "spark-protobuf", "spark-protobuf"))
+} catch {
+  case _: AssertionError => None

Review Comment:
   What is throwing the assertion error?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wankunde commented on a diff in pull request #40523: [SPARK-42897][SQL] Avoid evaluate variables multiple times for SMJ and SHJ fullOuter join

2023-04-06 Thread via GitHub



wankunde commented on code in PR #40523:
URL: https://github.com/apache/spark/pull/40523#discussion_r1160381740


##
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala:
##
@@ -1036,8 +1036,17 @@ case class SortMergeJoinExec(
 val rightResultVars = genOneSideJoinVars(
   ctx, rightOutputRow, right, setDefaultValue = true)
 val resultVars = leftResultVars ++ rightResultVars
-val (_, conditionCheck, _) =
-  getJoinCondition(ctx, leftResultVars, left, right, Some(rightOutputRow))
+// Evaluate the variables on the left and used in the condition but do not 
clear the code as

Review Comment:
   Yes, you are right, I updated the code and it is much clearer than before. 
   Thanks for your help.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40605: [SPARK-42958][CONNECT] Refactor `connect-jvm-client-mima-check` to support mima check with avro module

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40605:
URL: https://github.com/apache/spark/pull/40605#discussion_r1160381455


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala:
##
@@ -62,15 +62,29 @@ object CheckConnectJvmClientCompatibility {
   "spark-connect-client-jvm-assembly",
   "spark-connect-client-jvm")
   val sqlJar: File = findJar("sql/core", "spark-sql", "spark-sql")
-  val problems = checkMiMaCompatibility(clientJar, sqlJar)
-  if (problems.nonEmpty) {
+  val problemsWithSqlModule = 
checkMiMaCompatibilityWithSqlModule(clientJar, sqlJar)
+  if (problemsWithSqlModule.nonEmpty) {
 resultWriter.write(s"ERROR: Comparing client jar: $clientJar and and 
sql jar: $sqlJar \n")
-resultWriter.write(s"problems: \n")
-resultWriter.write(s"${problems.map(p => 
p.description("client")).mkString("\n")}")
+resultWriter.write(s"problemsWithSqlModule: \n")

Review Comment:
   Let's add a helper function for the problem logging. Also use multiline 
strings there, they tend to be easier to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Hisoka-X commented on a diff in pull request #40684: [SPARK-41532][CONNECT][CLIENT] Add check for operations that involve multiple data frames

2023-04-06 Thread via GitHub



Hisoka-X commented on code in PR #40684:
URL: https://github.com/apache/spark/pull/40684#discussion_r1160381245


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/util/RemoteSparkSession.scala:
##
@@ -49,7 +49,7 @@ import org.apache.spark.util.Utils
 object SparkConnectServerUtils {
 
   // Server port
-  private[connect] val port = ConnectCommon.CONNECT_GRPC_BINDING_PORT + 
util.Random.nextInt(1000)
+  val port: Int = ConnectCommon.CONNECT_GRPC_BINDING_PORT + 
util.Random.nextInt(1000)

Review Comment:
   > You could use `private[spark] val port: Int` accessor?
   
   Thanks for remind, I changed it.



##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -69,6 +69,9 @@ class SparkSession private[sql] (
 
   private[this] val allocator = new RootAllocator()
 
+  // a unique session ID for this session from client.
+  private[sql] lazy val sessionId: String = client.sessionId

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40676: [SPARK-42656][FOLLOWUP] Add BUILD and SCCLASSPATH options to Spark Connect scripts

2023-04-06 Thread via GitHub



HyukjinKwon commented on code in PR #40676:
URL: https://github.com/apache/spark/pull/40676#discussion_r1160381167


##
connector/connect/bin/spark-connect:
##
@@ -30,8 +30,11 @@ export SPARK_HOME=$FWDIR
 SCALA_BINARY_VER=`grep "scala.binary.version" "${SPARK_HOME}/pom.xml" | head 
-n1 | awk -F '[<>]' '{print $3}'`
 SCALA_ARG="-Pscala-${SCALA_BINARY_VER}"
 
-# Build the jars needed for spark submit and spark connect
-build/sbt "${SCALA_ARG}" -Phive -Pconnect package
+BUILD="${BUILD:-1}"

Review Comment:
   Sorry for a posthoc review. I wonder if we should better pick a different 
name though. `BUILD` might be too common ..  and that might easily conflict.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40605: [SPARK-42958][CONNECT] Refactor `connect-jvm-client-mima-check` to support mima check with avro module

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40605:
URL: https://github.com/apache/spark/pull/40605#discussion_r1160380890


##
dev/connect-jvm-client-mima-check:
##
@@ -34,20 +34,18 @@ fi
 
 rm -f .connect-mima-check-result
 
-echo "Build sql module, connect-client-jvm module and connect-client-jvm test 
jar..."
-build/sbt 
"sql/package;connect-client-jvm/assembly;connect-client-jvm/Test/package"
+echo "Build required modules..."
+build/sbt "assembly/package;connect-client-jvm/Test/package"
 
+ASSEMBLY_CLASSPATH="$(build/sbt -DcopyDependencies=false "export 
assembly/fullClasspath" | grep jar | tail -n1)"
 CONNECT_TEST_CLASSPATH="$(build/sbt -DcopyDependencies=false "export 
connect-client-jvm/Test/fullClasspath" | grep jar | tail -n1)"
-CONNECT_CLASSPATH="$(build/sbt -DcopyDependencies=false "export 
connect-client-jvm/fullClasspath" | grep jar | tail -n1)"
-SQL_CLASSPATH="$(build/sbt -DcopyDependencies=false "export sql/fullClasspath" 
| grep jar | tail -n1)"
-
 
 echo "Do connect-client-jvm module mima check ..."
 
 $JAVA_CMD \
   -Xmx2g \
   -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.util.jar=ALL-UNNAMED \
-  -cp "$CONNECT_CLASSPATH:$CONNECT_TEST_CLASSPATH:$SQL_CLASSPATH" \
+  -cp "$CONNECT_TEST_CLASSPATH:$ASSEMBLY_CLASSPATH" \

Review Comment:
   Why do these jars need to be on the class path? It seems that we are reading 
the jars directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160380014


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala:
##
@@ -379,4 +386,32 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 assert(row(0).length === 1)
 assert(row(0).getString(0) === "$1,000.00")
   }
+
+  test("SPARK-43040: timestamp_ntz read test") {
+val prop = new Properties
+prop.setProperty("preferTimestampNTZ", "true")
+val df = sqlContext.read.jdbc(jdbcUrl, "timestamp_ntz", prop)
+val row = df.collect()

Review Comment:
   Do you mean `spark.sql.datetime.java8API.enabled`? This only controls 
`TimestampType` but not `TimestampNTZType` if I am not mistaken.
   `TimestampNTZType` always converts to `LocalDateTime`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin commented on pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

2023-04-06 Thread via GitHub



ueshin commented on PR #40692:
URL: https://github.com/apache/spark/pull/40692#issuecomment-1499840711

   > Just FYI, vanilla PySpark's DataFrame.toPandas also has this issue 
[issues.apache.org/jira/browse/SPARK-41971](https://issues.apache.org/jira/browse/SPARK-41971)
   Is it possible to move the changes to ArrowUtils to fix them all?
   
   Yes, I'm aware of the issue, but let me hold on it to the following PRs.
   
   TLDR;
   
   Actually this PR still has an issue with `toPandas`.
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
  x y
   0  1  {'a_0': 1, 'a_1': 2}
   ```
   
   The duplicated fields have suffix `_1`, `_2`, and so on.
   
   Also, handling struct type in `toPandas` was not well-defined and there are 
behavior difference even between Arrow enabled/disabled in PySpark.
   
   ```py
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
  x   y
   0  1  (1, 2)
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
  x y
   0  1  {'a': 1, 'b': 2}
   ```
   
   Currently PySpark with Arrow enabled, and Spark Connect, use a map for the 
struct type object as a result, whereas `Row` object in PySpark without Arrow.
   
   The options are:
   
   1. It's ok to be different, also with suffix.
   - In this case, the suffix is a must because a map object will hold only 
one value for the duplicates.
   2. `Row` object should be used for the struct.
   - In this case, we will lose the benefit of Arrow -> pandas fast 
conversion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160376168


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala:
##
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)

Review Comment:
   I tried the Postgres specific solution for the existing H2 test and it is 
not working.
   
   I checked H2 driver as well and I think what happens is that H2 is creating 
the timestamp using the the milliseconds from epoch and THEN converting the 
wall clock time to the represent the instant in local Timezone. This change in 
order makes the difference. If we take the previous case as an example, the 
resultant Timestamp would be "2023-04-05 01:00:00 America/Los_Angeles". This 
represent the same instant as "2023-04-05 08:00:00 UTC" which is why storing 
its microseconds from epoch works. It also explain why converting to 
LocalDateTime (Postgres specific solution) would not work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40656: [SPARK-43023][CONNECT][TESTS] Add switch catalog testing scenario for `CatalogSuite`

2023-04-06 Thread via GitHub



hvanhovell closed pull request #40656: [SPARK-43023][CONNECT][TESTS] Add switch 
catalog testing scenario for `CatalogSuite`
URL: https://github.com/apache/spark/pull/40656


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160377729


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala:
##
@@ -104,6 +105,23 @@ abstract class JdbcDialect extends Serializable with 
Logging {
*/
   def getJDBCType(dt: DataType): Option[JdbcType] = None
 
+  /**
+   * Convert java.sql.Timestamp to a Long value (internal representation of a 
TimestampNTZType)
+   * holding the microseconds since the epoch of 1970-01-01 00:00:00Z for this 
timestamp.
+   */
+  def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.fromJavaTimestampNoRebase(t)
+  }
+
+  /**
+   * Converts a LocalDateTime representing a TimestampNTZ type to an
+   * instance of `java.sql.Timestamp`.
+   */
+  def convertTimestampNTZToJavaTimestamp(ldt: LocalDateTime): Timestamp = {

Review Comment:
   It would be hard to make them symmetric actually.
   For example, the neutral conversion solution here stores the Java 
Timestamp's underlining microseconds from epoch directly into the long value. 
It would be wrong to construct a `LocalDateTime` from the year, day, ..., 
minute, second fields in the Timestamp.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



beliefer commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160377410


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala:
##
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)

Review Comment:
   Does Postgres have TimestampNTZ ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160376444


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala:
##
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)

Review Comment:
   To conclude, JDBC drivers have different expected behaviors in regard to 
implementing "getTimestamp" and "setTimestamp". A general conversion strategy 
would not work for all of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #40683: [SPARK-43049][SQL] Use CLOB instead of VARCHAR(255) for StringType for Oracle JDBC

2023-04-06 Thread via GitHub



yaooqinn commented on PR #40683:
URL: https://github.com/apache/spark/pull/40683#issuecomment-1499832054

   cc @dongjoon-hyun @cloud-fan thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160376168


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala:
##
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)

Review Comment:
   I tried the Postgres specific solution for the existing H2 test and it is 
not working.
   
   I checked H2 driver as well and I think what happens is that H2 is creating 
the timestamp using the the milliseconds from epoch and THEN converting the 
wall clock time to the local Timezone. This change in order makes the 
difference. If we take the previous case as an example, the resultant Timestamp 
would be "2023-04-05 01:00:00 America/Los_Angeles". This represent the same 
instant as "2023-04-05 08:00:00 UTC" which is why storing its microseconds from 
epoch works. It also explain why converting to LocalDateTime (Postgres specific 
solution) would not work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40651: [SPARK-43019][SQL] Move Ordering to PhysicalDataType

2023-04-06 Thread via GitHub



hvanhovell closed pull request #40651: [SPARK-43019][SQL] Move Ordering to 
PhysicalDataType
URL: https://github.com/apache/spark/pull/40651


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40651: [SPARK-43019][SQL] Move Ordering to PhysicalDataType

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40651:
URL: https://github.com/apache/spark/pull/40651#discussion_r1160375867


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala:
##
@@ -37,6 +37,7 @@ class BaseOrdering extends Ordering[InternalRow] {
  * An interpreted row ordering comparator.
  */
 class InterpretedOrdering(ordering: Seq[SortOrder]) extends BaseOrdering {
+  private lazy val physicalDataTypes = ordering.map(order => 
PhysicalDataType(order.dataType))

Review Comment:
   try to avoid lazy vals on a hot path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40651: [SPARK-43019][SQL] Move Ordering to PhysicalDataType

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40651:
URL: https://github.com/apache/spark/pull/40651#discussion_r1160375748


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala:
##
@@ -56,21 +57,12 @@ class InterpretedOrdering(ordering: Seq[SortOrder]) extends 
BaseOrdering {
   } else if (right == null) {
 return if (order.nullOrdering == NullsFirst) 1 else -1
   } else {
+val orderingFunc = 
physicalDataTypes(i).ordering.asInstanceOf[Ordering[Any]]
 val comparison = order.dataType match {
-  case dt: AtomicType if order.direction == Ascending =>
-dt.ordering.asInstanceOf[Ordering[Any]].compare(left, right)
-  case dt: AtomicType if order.direction == Descending =>
-- dt.ordering.asInstanceOf[Ordering[Any]].compare(left, right)
-  case a: ArrayType if order.direction == Ascending =>
-a.interpretedOrdering.asInstanceOf[Ordering[Any]].compare(left, 
right)
-  case a: ArrayType if order.direction == Descending =>
-- a.interpretedOrdering.asInstanceOf[Ordering[Any]].compare(left, 
right)
-  case s: StructType if order.direction == Ascending =>
-s.interpretedOrdering.asInstanceOf[Ordering[Any]].compare(left, 
right)
-  case s: StructType if order.direction == Descending =>
-- s.interpretedOrdering.asInstanceOf[Ordering[Any]].compare(left, 
right)
-  case other =>
-throw 
QueryExecutionErrors.orderedOperationUnsupportedByDataTypeError(other)
+  case _ if order.direction == Ascending =>

Review Comment:
   You can move ascending/descending into the lazy val as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40651: [SPARK-43019][SQL] Move Ordering to PhysicalDataType

2023-04-06 Thread via GitHub



hvanhovell commented on code in PR #40651:
URL: https://github.com/apache/spark/pull/40651#discussion_r1160375619


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala:
##
@@ -56,21 +57,12 @@ class InterpretedOrdering(ordering: Seq[SortOrder]) extends 
BaseOrdering {
   } else if (right == null) {
 return if (order.nullOrdering == NullsFirst) 1 else -1
   } else {
+val orderingFunc = 
physicalDataTypes(i).ordering.asInstanceOf[Ordering[Any]]
 val comparison = order.dataType match {

Review Comment:
   You don't need this match anymore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40651: [SPARK-43019][SQL] Move Ordering to PhysicalDataType

2023-04-06 Thread via GitHub



hvanhovell commented on PR #40651:
URL: https://github.com/apache/spark/pull/40651#issuecomment-1499830325

   Merging. Please address remaining comments (if any) in a follow-up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tianhanhu-db commented on a diff in pull request #40678: [SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source

2023-04-06 Thread via GitHub



tianhanhu-db commented on code in PR #40678:
URL: https://github.com/apache/spark/pull/40678#discussion_r1160368163


##
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala:
##
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with 
SQLConfHelper {
 case _ => None
   }
 
+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)

Review Comment:
   I will give a concrete Postgres read example where the general 
implementation would fail.
   
   Say there is a Timestamp of "2023-04-05 08:00:00" stored in Postgres 
database and we want to read it as Spark TimestampNTZType from a TimeZone of 
America/Los_Angeles. The expected results would be "2023-04-05 08:00:00".
   
   When we do `PostgresDriver.getTimestamp`, what happens under the hood is 
that Postgres would use the default JVM TimeZone and create a Timestamp 
representing an instant of the wall clock in that time zone. Thus, the Java 
Timestamp effectively represents "2023-04-05 08:00:00 America/Los_Angeles". 
   
   With our general conversion, we will just store the underlining microseconds 
from epoch to represent the TimestampNTZType. This is problematic as when 
displaying the TimestampNTZType, we convert to a LocalDateTime using UTC as the 
time zone. This will give as an erroneous result of "2023-04-05 15:00:00".
   
   The Postgres specific conversion first convert the Java Timestamp to 
LocalDateTime before getting its underlining milliseconds from epoch. This 
basically restores the Timestamp to represent "2023-04-05 08:00:00 UTC". Thus 
when converting back we get the correct result.
   
   For write it is the similar story. @cloud-fan @beliefer 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40676: [SPARK-42656][FOLLOWUP] Add BUILD and SCCLASSPATH options to Spark Connect scripts

2023-04-06 Thread via GitHub



hvanhovell closed pull request #40676: [SPARK-42656][FOLLOWUP] Add BUILD and 
SCCLASSPATH options to Spark Connect scripts
URL: https://github.com/apache/spark/pull/40676


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 282 matches

Mail list logo