date:20231108

Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43691:
URL: https://github.com/apache/spark/pull/43691#issuecomment-1803327334

   Merged into master and branch-3.5. Thanks @xieshuaihu @hvanhovell 
@dongjoon-hyun @cfmcgrady 
   
   This patch conflicts with branch-3.4. Could you submit a separate pr for 
branch-3.4? @xieshuaihu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]

2023-11-08 Thread via GitHub



LuciferYang closed pull request #43691: [SPARK-45814][CONNECT][SQL]Make 
ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak
URL: https://github.com/apache/spark/pull/43691


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]

2023-11-08 Thread via GitHub



MaxGekk commented on code in PR #43710:
URL: https://github.com/apache/spark/pull/43710#discussion_r1387600755


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -3899,6 +3899,12 @@
 },
 "sqlState" : "42605"
   },
+  "XML_ROW_TAG_OPTION_REQUIRED" : {

Review Comment:
   Yep, I agree. The error class name should point out a problem, possible 
solutions should be in `message`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]

2023-11-08 Thread via GitHub



xieshuaihu commented on PR #43691:
URL: https://github.com/apache/spark/pull/43691#issuecomment-1803309496

   @LuciferYang all ut passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45826][SQL] Add a SQL config for extra traces in `Origin` [spark]

2023-11-08 Thread via GitHub



MaxGekk commented on code in PR #43695:
URL: https://github.com/apache/spark/pull/43695#discussion_r1387598312


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -4531,6 +4531,14 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val EXTRA_ORIGIN_TRACES = buildConf("spark.sql.extraOriginTraces")
+.doc("The number of additional non-Spark SQL traces in the captured 
DataFrame context. " +
+  "When it is set to 0, captured one Spark traces and a followed non-Spark 
trace.")
+.version("4.0.0")
+.intConf
+.checkValue(_ >= 0, "The number of extra thread traces must be 
non-negative.")

Review Comment:
   > Intuitively, I feel that 0 should represent the absence of non-Spark trace.
   
   Actually, it works in this way. Let me modify the config and PR description. 
The `slice` method excludes the `until` index. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45770][SQL][PYTHON][CONNECT] Introduce logical plan `UnresolvedDropColumns` for `Dataframe.drop` [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on code in PR #43683:
URL: https://github.com/apache/spark/pull/43683#discussion_r1387588910


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDataFrameDropColumns.scala:
##
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{DataFrameDropColumns, 
LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.DF_DROP_COLUMNS
+import org.apache.spark.sql.connector.catalog.CatalogManager
+
+/**
+ * A rule that rewrites DropColumns to Project.
+ */
+class ResolveDataFrameDropColumns(val catalogManager: CatalogManager)
+  extends Rule[LogicalPlan] with ColumnResolutionHelper  {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
+_.containsPattern(DF_DROP_COLUMNS)) {
+case d: DataFrameDropColumns if d.childrenResolved =>
+  val dropped = d.dropList.map {
+case u: UnresolvedAttribute =>

Review Comment:
   is this for `df1.drop(df2.col)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 [spark]

2023-11-08 Thread via GitHub



cloud-fan closed pull request #43614: [SPARK-45752][SQL] Unreferenced CTE 
should all be checked by CheckAnalysis0
URL: https://github.com/apache/spark/pull/43614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on PR #43614:
URL: https://github.com/apache/spark/pull/43614#issuecomment-1803285997

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1387580155


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.unsafe.Platform;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+/**
+ * The physical data representation of {@link 
org.apache.spark.sql.types.VariantType} that
+ * represents a semi-structured value. It consists of two binary values: 
{@link VariantVal#value}
+ * and {@link VariantVal#metadata}. The value encodes types and values, but 
not field names. The
+ * metadata currently contains a version flag and a list of field names. We 
can extend/modify the
+ * detailed binary format given the version flag.
+ * 
+ * A {@link VariantVal} can be produced by casting another value into the 
Variant type or parsing a
+ * JSON string in the {@link 
org.apache.spark.sql.catalyst.expressions.variant.ParseJson}
+ * expression. We can extract a path consisting of field names and array 
indices from it, cast it
+ * into a concrete data type, or rebuild a JSON string from it.
+ * 
+ * The storage layout of this class in {@link 
org.apache.spark.sql.catalyst.expressions.UnsafeRow}
+ * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: 
the fixed-size part is
+ * a long value "offsetAndSize". The upper 32 bits is the offset that points 
to the start position
+ * of the actual binary content. The lower 32 bits is the total length of the 
binary content. The
+ * binary content contains: 4 bytes representing the length of {@link 
VariantVal#value}, content of
+ * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is 
an internal and
+ * transient format and can be modified at any time.
+ */
+public class VariantVal implements Serializable {
+  protected final byte[] value;
+  protected final byte[] metadata;
+
+  public VariantVal(byte[] value, byte[] metadata) {
+this.value = value;
+this.metadata = metadata;
+  }
+
+  public byte[] getValue() {
+return value;
+  }
+
+  public byte[] getMetadata() {
+return metadata;
+  }
+
+  /**
+   * This function writes the binary content into {@code buffer} starting from 
{@code cursor}, as
+   * described in the class comment. The caller should guarantee there is 
enough space in `buffer`.
+   */
+  public void writeIntoUnsafeRow(byte[] buffer, long cursor) {

Review Comment:
   interesting, `UnsafeWriter` always use `byte[]`, then it's fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43709:
URL: https://github.com/apache/spark/pull/43709#issuecomment-1803284125

   Thanks @beliefer ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on PR #43707:
URL: https://github.com/apache/spark/pull/43707#issuecomment-1803280319

   We should also check all the call sites of 
`DataSource#disallowWritingIntervals` and also disallow writing variants to DS 
v1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on PR #43707:
URL: https://github.com/apache/spark/pull/43707#issuecomment-1803277101

   Can we check `FileFormat#supportDataType` and make sure only Parquet 
supports it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43710:
URL: https://github.com/apache/spark/pull/43710#discussion_r1387573558


##
common/utils/src/main/resources/error/error-classes.json:
##
@@ -3899,6 +3899,12 @@
 },
 "sqlState" : "42605"
   },
+  "XML_ROW_TAG_OPTION_REQUIRED" : {

Review Comment:
   How about `XML_ROW_TAG_MISSING`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45830][CORE] Refactor `StorageUtils#bufferCleaner` to avoid directly using classes under the `sun` package [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on code in PR #43675:
URL: https://github.com/apache/spark/pull/43675#discussion_r1387574129


##
core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:
##
@@ -197,13 +195,18 @@ private[spark] class StorageStatus(
 /** Helper methods for storage-related objects. */
 private[spark] object StorageUtils extends Logging {
 
-  private val bufferCleaner: DirectBuffer => Unit = {
-val cleanerMethod =
-  Utils.classForName("sun.misc.Unsafe").getMethod("invokeCleaner", 
classOf[ByteBuffer])
-val unsafeField = classOf[Unsafe].getDeclaredField("theUnsafe")
-unsafeField.setAccessible(true)
-val unsafe = unsafeField.get(null).asInstanceOf[Unsafe]
-buffer: DirectBuffer => cleanerMethod.invoke(unsafe, buffer)
+  private val bufferCleaner: ByteBuffer => Unit = {
+val cleanerClass = Utils.classForName("jdk.internal.ref.Cleaner")
+val directBufferClass = Utils.classForName("sun.nio.ch.DirectBuffer")
+val byteBufferLookup: MethodHandles.Lookup =
+  MethodHandles.privateLookupIn(directBufferClass, MethodHandles.lookup())
+val cleanerMethod: MethodHandle = byteBufferLookup
+  .findVirtual(directBufferClass, "cleaner", 
MethodType.methodType(cleanerClass))
+val cleanerLookup: MethodHandles.Lookup =
+  MethodHandles.privateLookupIn(cleanerClass, MethodHandles.lookup())
+val cleanMethod: MethodHandle =
+  cleanerLookup.findVirtual(cleanerClass, "clean", 
MethodType.methodType(classOf[Unit]))
+buffer: ByteBuffer => cleanMethod.invoke(cleanerMethod.invoke(buffer))

Review Comment:
   The new method is indeed slower than the old method because there are two 
calls to the method handle. Let me think about what to do. @dongjoon-hyun 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43710:
URL: https://github.com/apache/spark/pull/43710#discussion_r1387571783


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##
@@ -66,7 +66,11 @@ private[sql] class XmlOptions(
 
   val compressionCodec = 
parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
   val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim)
-  require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' 
option is required.")
+
+  if (rowTagRequired && rowTagOpt.isEmpty) {
+throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG)

Review Comment:
   I got it now. I'm missing the `rowTagRequired`. The `rowTagRequired` is 
false if the call the `to_xml`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1387567758


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala:
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.variant
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types._
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(jsonStr) - Parse a JSON string as an Variant value. Throw an 
exception when the string is not valid JSON value.",
+  examples =
+"""
+Examples:
+  """,
+  since = "3.4.0",

Review Comment:
   `4.0.0`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



chenhao-db commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1387567192


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.unsafe.Platform;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+/**
+ * The physical data representation of {@link 
org.apache.spark.sql.types.VariantType} that
+ * represents a semi-structured value. It consists of two binary values: 
{@link VariantVal#value}
+ * and {@link VariantVal#metadata}. The value encodes types and values, but 
not field names. The
+ * metadata currently contains a version flag and a list of field names. We 
can extend/modify the
+ * detailed binary format given the version flag.
+ * 
+ * A {@link VariantVal} can be produced by casting another value into the 
Variant type or parsing a
+ * JSON string in the {@link 
org.apache.spark.sql.catalyst.expressions.variant.ParseJson}
+ * expression. We can extract a path consisting of field names and array 
indices from it, cast it
+ * into a concrete data type, or rebuild a JSON string from it.
+ * 
+ * The storage layout of this class in {@link 
org.apache.spark.sql.catalyst.expressions.UnsafeRow}
+ * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: 
the fixed-size part is
+ * a long value "offsetAndSize". The upper 32 bits is the offset that points 
to the start position
+ * of the actual binary content. The lower 32 bits is the total length of the 
binary content. The
+ * binary content contains: 4 bytes representing the length of {@link 
VariantVal#value}, content of
+ * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is 
an internal and
+ * transient format and can be modified at any time.
+ */
+public class VariantVal implements Serializable {
+  protected final byte[] value;
+  protected final byte[] metadata;
+
+  public VariantVal(byte[] value, byte[] metadata) {
+this.value = value;
+this.metadata = metadata;
+  }
+
+  public byte[] getValue() {
+return value;
+  }
+
+  public byte[] getMetadata() {
+return metadata;
+  }
+
+  /**
+   * This function writes the binary content into {@code buffer} starting from 
{@code cursor}, as
+   * described in the class comment. The caller should guarantee there is 
enough space in `buffer`.
+   */
+  public void writeIntoUnsafeRow(byte[] buffer, long cursor) {
+Platform.putInt(buffer, cursor, value.length);
+Platform.copyMemory(value, Platform.BYTE_ARRAY_OFFSET, buffer, cursor + 4, 
value.length);
+Platform.copyMemory(
+metadata,
+Platform.BYTE_ARRAY_OFFSET,
+buffer,
+cursor + 4 + value.length,
+metadata.length
+);
+  }
+
+  /**
+   * This function reads the binary content described in `writeIntoUnsafeRow` 
from `baseObject`. The
+   * offset is computed by adding the offset in {@code offsetAndSize} and 
{@code baseOffset}.
+   */
+  public static VariantVal readFromUnsafeRow(long offsetAndSize, Object 
baseObject,
+ long baseOffset) {
+// offset and totalSize is the upper/lower 32 bits in offsetAndSize.
+int offset = (int) (offsetAndSize >> 32);
+int totalSize = (int) offsetAndSize;
+int valueSize = Platform.getInt(baseObject, baseOffset + offset);
+int metadataSize = totalSize - 4 - valueSize;
+byte[] value = new byte[valueSize];
+byte[] metadata = new byte[metadataSize];
+Platform.copyMemory(

Review Comment:
   I know how `UTF8String` works, but I feel it is simpler to have `byte[]` in 
the `VariantVal` object instead of `baseObject + baseOffset`. I prefer to have 
this version first and it is not something unchangeable in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]

2023-11-08 Thread via GitHub



sandip-db commented on code in PR #43710:
URL: https://github.com/apache/spark/pull/43710#discussion_r1387566546


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##
@@ -66,7 +66,11 @@ private[sql] class XmlOptions(
 
   val compressionCodec = 
parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
   val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim)
-  require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' 
option is required.")
+
+  if (rowTagRequired && rowTagOpt.isEmpty) {
+throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG)

Review Comment:
   Currently, `to_xml` doesn't throw an error if `rowTag` is not specified 
because it uses the default. If we do the change that you suggested above, even 
`to_xml` will require `rowTag` and throw an error if one is not provided. This 
PR is just standardizing the error that gets thrown when rowTag is not provided 
on "read". It doesn't change any other behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1387563169


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.unsafe.Platform;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+/**
+ * The physical data representation of {@link 
org.apache.spark.sql.types.VariantType} that
+ * represents a semi-structured value. It consists of two binary values: 
{@link VariantVal#value}
+ * and {@link VariantVal#metadata}. The value encodes types and values, but 
not field names. The
+ * metadata currently contains a version flag and a list of field names. We 
can extend/modify the
+ * detailed binary format given the version flag.
+ * 
+ * A {@link VariantVal} can be produced by casting another value into the 
Variant type or parsing a
+ * JSON string in the {@link 
org.apache.spark.sql.catalyst.expressions.variant.ParseJson}
+ * expression. We can extract a path consisting of field names and array 
indices from it, cast it
+ * into a concrete data type, or rebuild a JSON string from it.
+ * 
+ * The storage layout of this class in {@link 
org.apache.spark.sql.catalyst.expressions.UnsafeRow}
+ * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: 
the fixed-size part is
+ * a long value "offsetAndSize". The upper 32 bits is the offset that points 
to the start position
+ * of the actual binary content. The lower 32 bits is the total length of the 
binary content. The
+ * binary content contains: 4 bytes representing the length of {@link 
VariantVal#value}, content of
+ * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is 
an internal and
+ * transient format and can be modified at any time.
+ */
+public class VariantVal implements Serializable {
+  protected final byte[] value;
+  protected final byte[] metadata;
+
+  public VariantVal(byte[] value, byte[] metadata) {
+this.value = value;
+this.metadata = metadata;
+  }
+
+  public byte[] getValue() {
+return value;
+  }
+
+  public byte[] getMetadata() {
+return metadata;
+  }
+
+  /**
+   * This function writes the binary content into {@code buffer} starting from 
{@code cursor}, as
+   * described in the class comment. The caller should guarantee there is 
enough space in `buffer`.
+   */
+  public void writeIntoUnsafeRow(byte[] buffer, long cursor) {
+Platform.putInt(buffer, cursor, value.length);
+Platform.copyMemory(value, Platform.BYTE_ARRAY_OFFSET, buffer, cursor + 4, 
value.length);
+Platform.copyMemory(
+metadata,
+Platform.BYTE_ARRAY_OFFSET,
+buffer,
+cursor + 4 + value.length,
+metadata.length
+);
+  }
+
+  /**
+   * This function reads the binary content described in `writeIntoUnsafeRow` 
from `baseObject`. The
+   * offset is computed by adding the offset in {@code offsetAndSize} and 
{@code baseOffset}.
+   */
+  public static VariantVal readFromUnsafeRow(long offsetAndSize, Object 
baseObject,
+ long baseOffset) {
+// offset and totalSize is the upper/lower 32 bits in offsetAndSize.
+int offset = (int) (offsetAndSize >> 32);
+int totalSize = (int) offsetAndSize;
+int valueSize = Platform.getInt(baseObject, baseOffset + offset);
+int metadataSize = totalSize - 4 - valueSize;
+byte[] value = new byte[valueSize];
+byte[] metadata = new byte[metadataSize];
+Platform.copyMemory(

Review Comment:
   We can avoid copy if `VariantVal` can follow `UTF8String` and also represent 
its data as `baseObject + baseOffset`. Does Java have nicer way to do it now? 
cc @rednaxelafx  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe,

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



chenhao-db commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1387562510


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.unsafe.Platform;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+/**
+ * The physical data representation of {@link 
org.apache.spark.sql.types.VariantType} that
+ * represents a semi-structured value. It consists of two binary values: 
{@link VariantVal#value}
+ * and {@link VariantVal#metadata}. The value encodes types and values, but 
not field names. The
+ * metadata currently contains a version flag and a list of field names. We 
can extend/modify the
+ * detailed binary format given the version flag.
+ * 
+ * A {@link VariantVal} can be produced by casting another value into the 
Variant type or parsing a
+ * JSON string in the {@link 
org.apache.spark.sql.catalyst.expressions.variant.ParseJson}
+ * expression. We can extract a path consisting of field names and array 
indices from it, cast it
+ * into a concrete data type, or rebuild a JSON string from it.
+ * 
+ * The storage layout of this class in {@link 
org.apache.spark.sql.catalyst.expressions.UnsafeRow}
+ * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: 
the fixed-size part is
+ * a long value "offsetAndSize". The upper 32 bits is the offset that points 
to the start position
+ * of the actual binary content. The lower 32 bits is the total length of the 
binary content. The
+ * binary content contains: 4 bytes representing the length of {@link 
VariantVal#value}, content of
+ * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is 
an internal and
+ * transient format and can be modified at any time.
+ */
+public class VariantVal implements Serializable {
+  protected final byte[] value;
+  protected final byte[] metadata;
+
+  public VariantVal(byte[] value, byte[] metadata) {
+this.value = value;
+this.metadata = metadata;
+  }
+
+  public byte[] getValue() {
+return value;
+  }
+
+  public byte[] getMetadata() {
+return metadata;
+  }
+
+  /**
+   * This function writes the binary content into {@code buffer} starting from 
{@code cursor}, as
+   * described in the class comment. The caller should guarantee there is 
enough space in `buffer`.
+   */
+  public void writeIntoUnsafeRow(byte[] buffer, long cursor) {

Review Comment:
   I'm not quite sure what you mean. This function is called by `UnsafeWriter`, 
which always uses `byte[]` to build an `UnsafeRow`. There is no benefit of 
changing this function to use a generic base object.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43717:
URL: https://github.com/apache/spark/pull/43717#issuecomment-1803258795

   > > Otherwise, I'm worried that these problems will arise after some time.
   > 
   > Good question. But I feel the `scalafmt` can't resolve these issue. 
@LuciferYang Do you know how to add the check rules of IntelliJ to the Spark 
builder?
   
   These are two inspection systems, at least now I don't know how to do it We 
need some additional investigation
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43710:
URL: https://github.com/apache/spark/pull/43710#discussion_r1387559815


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##
@@ -66,7 +66,11 @@ private[sql] class XmlOptions(
 
   val compressionCodec = 
parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
   val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim)
-  require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' 
option is required.")
+
+  if (rowTagRequired && rowTagOpt.isEmpty) {
+throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG)

Review Comment:
   So, the error appears if users call `to_xml` and do not specify the rowTag 
option?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1387559447


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.unsafe.Platform;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+/**
+ * The physical data representation of {@link 
org.apache.spark.sql.types.VariantType} that
+ * represents a semi-structured value. It consists of two binary values: 
{@link VariantVal#value}
+ * and {@link VariantVal#metadata}. The value encodes types and values, but 
not field names. The
+ * metadata currently contains a version flag and a list of field names. We 
can extend/modify the
+ * detailed binary format given the version flag.
+ * 
+ * A {@link VariantVal} can be produced by casting another value into the 
Variant type or parsing a
+ * JSON string in the {@link 
org.apache.spark.sql.catalyst.expressions.variant.ParseJson}
+ * expression. We can extract a path consisting of field names and array 
indices from it, cast it
+ * into a concrete data type, or rebuild a JSON string from it.
+ * 
+ * The storage layout of this class in {@link 
org.apache.spark.sql.catalyst.expressions.UnsafeRow}
+ * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: 
the fixed-size part is
+ * a long value "offsetAndSize". The upper 32 bits is the offset that points 
to the start position
+ * of the actual binary content. The lower 32 bits is the total length of the 
binary content. The
+ * binary content contains: 4 bytes representing the length of {@link 
VariantVal#value}, content of
+ * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is 
an internal and
+ * transient format and can be modified at any time.
+ */
+public class VariantVal implements Serializable {
+  protected final byte[] value;
+  protected final byte[] metadata;
+
+  public VariantVal(byte[] value, byte[] metadata) {
+this.value = value;
+this.metadata = metadata;
+  }
+
+  public byte[] getValue() {
+return value;
+  }
+
+  public byte[] getMetadata() {
+return metadata;
+  }
+
+  /**
+   * This function writes the binary content into {@code buffer} starting from 
{@code cursor}, as
+   * described in the class comment. The caller should guarantee there is 
enough space in `buffer`.
+   */
+  public void writeIntoUnsafeRow(byte[] buffer, long cursor) {

Review Comment:
   there is no `byte[]` if Spark is using offheap mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1387559229


##
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.unsafe.Platform;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+/**
+ * The physical data representation of {@link 
org.apache.spark.sql.types.VariantType} that
+ * represents a semi-structured value. It consists of two binary values: 
{@link VariantVal#value}
+ * and {@link VariantVal#metadata}. The value encodes types and values, but 
not field names. The
+ * metadata currently contains a version flag and a list of field names. We 
can extend/modify the
+ * detailed binary format given the version flag.
+ * 
+ * A {@link VariantVal} can be produced by casting another value into the 
Variant type or parsing a
+ * JSON string in the {@link 
org.apache.spark.sql.catalyst.expressions.variant.ParseJson}
+ * expression. We can extract a path consisting of field names and array 
indices from it, cast it
+ * into a concrete data type, or rebuild a JSON string from it.
+ * 
+ * The storage layout of this class in {@link 
org.apache.spark.sql.catalyst.expressions.UnsafeRow}
+ * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: 
the fixed-size part is
+ * a long value "offsetAndSize". The upper 32 bits is the offset that points 
to the start position
+ * of the actual binary content. The lower 32 bits is the total length of the 
binary content. The
+ * binary content contains: 4 bytes representing the length of {@link 
VariantVal#value}, content of
+ * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is 
an internal and
+ * transient format and can be modified at any time.
+ */
+public class VariantVal implements Serializable {
+  protected final byte[] value;
+  protected final byte[] metadata;
+
+  public VariantVal(byte[] value, byte[] metadata) {
+this.value = value;
+this.metadata = metadata;
+  }
+
+  public byte[] getValue() {
+return value;
+  }
+
+  public byte[] getMetadata() {
+return metadata;
+  }
+
+  /**
+   * This function writes the binary content into {@code buffer} starting from 
{@code cursor}, as
+   * described in the class comment. The caller should guarantee there is 
enough space in `buffer`.
+   */
+  public void writeIntoUnsafeRow(byte[] buffer, long cursor) {

Review Comment:
   ```suggestion
 public void writeIntoUnsafeRow(Object baseObject, long baseOffset, long 
cursor) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387554850


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala:
##
@@ -101,56 +98,19 @@ case class AnalyzePartitionCommand(
   if (noscan) {
 Map.empty
   } else {
-calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
+CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
   }
 
 // Update the metastore if newly computed statistics are different from 
those
 // recorded in the metastore.
-
-val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, 
tableMeta.identifier,
-  partitions.map(_.storage.locationUri))
-val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-  val newRowCount = rowCounts.get(p.spec)
-  val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
newRowCount)
-  newStats.map(_ => p.copy(stats = newStats))
-}
-
+val (_, newPartitions) = CommandUtils.calculatePartitionStats(
+  sparkSession, tableMeta, partitions, Some(rowCounts))
 if (newPartitions.nonEmpty) {
   sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions)
 }
 
 Seq.empty[Row]
   }
 
-  private def calculateRowCountsPerPartition(

Review Comment:
   Is calculateRowCountsPerPartition used only one place?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala:
##
@@ -101,56 +98,19 @@ case class AnalyzePartitionCommand(
   if (noscan) {
 Map.empty
   } else {
-calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
+CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
   }
 
 // Update the metastore if newly computed statistics are different from 
those
 // recorded in the metastore.
-
-val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, 
tableMeta.identifier,
-  partitions.map(_.storage.locationUri))
-val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-  val newRowCount = rowCounts.get(p.spec)
-  val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
newRowCount)
-  newStats.map(_ => p.copy(stats = newStats))
-}
-
+val (_, newPartitions) = CommandUtils.calculatePartitionStats(
+  sparkSession, tableMeta, partitions, Some(rowCounts))
 if (newPartitions.nonEmpty) {
   sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions)
 }
 
 Seq.empty[Row]
   }
 
-  private def calculateRowCountsPerPartition(

Review Comment:
   Is `calculateRowCountsPerPartition` used only one place?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #43717:
URL: https://github.com/apache/spark/pull/43717#issuecomment-1803250355

   > Otherwise, I'm worried that these problems will arise after some time.
   Good question. But I feel the `scalafmt` can't resolve these issue.
   @LuciferYang Do you know how to add the check rules of IntelliJ to the Spark 
builder?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45655][SQL][SS] Allow non-deterministic expressions inside AggregateFunctions in CollectMetrics [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #43517:
URL: https://github.com/apache/spark/pull/43517#issuecomment-1803246795

   Please check `SPARK-45655: Use current batch timestamp in observe API *** 
FAILED *** (241 milliseconds)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45848] Make `spark-version-info.properties` generated by `spark-build-info.ps1` include `docroot` [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43726:
URL: https://github.com/apache/spark/pull/43726#issuecomment-1803235338

   cc @wangyum FYI
   
   Step to check with ps on mac:
   1. install ps: https://github.com/powershell/powershell#get-powershell
   2. run `export USERNAME=${userName}`
   3. run `/usr/local/bin/pwsh -File build/spark-build-info.ps1 ${basedir} 
${spark-version}`
   4. run `cat ${basedir}/spark-version-info.properties` to check result


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



panbingkun commented on PR #43717:
URL: https://github.com/apache/spark/pull/43717#issuecomment-1803227663

   Do we have any `tools` that can help prohibit the `next time`? Eg: scalafmt ?
   Otherwise, I'm worried that these problems will arise after some time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45848] Make `spark-version-info.properties` generated by `spark-build-info.ps1` include `docroot` [spark]

2023-11-08 Thread via GitHub



LuciferYang opened a new pull request, #43726:
URL: https://github.com/apache/spark/pull/43726

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



sunchao commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387527377


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala:
##
@@ -101,56 +98,19 @@ case class AnalyzePartitionCommand(
   if (noscan) {
 Map.empty
   } else {
-calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
+CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
   }
 
 // Update the metastore if newly computed statistics are different from 
those
 // recorded in the metastore.
-
-val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, 
tableMeta.identifier,
-  partitions.map(_.storage.locationUri))
-val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-  val newRowCount = rowCounts.get(p.spec)
-  val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
newRowCount)
-  newStats.map(_ => p.copy(stats = newStats))
-}
-
+val (_, newPartitions) = CommandUtils.calculatePartitionStats(
+  sparkSession, tableMeta, partitions, Some(rowCounts))
 if (newPartitions.nonEmpty) {
   sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions)
 }
 
 Seq.empty[Row]
   }
 
-  private def calculateRowCountsPerPartition(

Review Comment:
   Hmm what do you mean? it is used multiple times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44307][SQL] Add Bloom filter for left outer join even if the left side table is smaller than broadcast threshold. [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #41860:
URL: https://github.com/apache/spark/pull/41860#issuecomment-1803217192

   @maheshk114 Could you rebase this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387524303


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -363,6 +363,83 @@ class StatisticsSuite extends StatisticsCollectionTestBase 
with TestHiveSingleto
 }
   }
 
+  test("SPARK-45731: update partition stats with ANALYZE TABLE") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): Option[CatalogStatistics] = {
+  val partition =
+spark.sessionState.catalog.getPartition(TableIdentifier(tableName), 
Map("ds" -> ds))
+  partition.stats
+}
+
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+
+Seq(true, false).foreach { partitionStatsEnabled =>
+  withSQLConf(SQLConf.UPDATE_PART_STATS_IN_ANALYZE_TABLE_ENABLED.key ->
+partitionStatsEnabled.toString) {
+withTable(tableName) {
+  withTempPath { path =>
+// Create a table with 3 partitions all located under a directory 
'path'
+sql(
+  s"""
+ |CREATE TABLE $tableName (key STRING, value STRING)
+ |USING hive
+ |PARTITIONED BY (ds STRING)
+ |LOCATION '${path.toURI}'
+   """.stripMargin)
+
+partitionDates.foreach { ds =>
+  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION 
'$path/ds=$ds'")
+  sql("SELECT * FROM src").write.mode(SaveMode.Overwrite)
+  .format("parquet").save(s"$path/ds=$ds")
+}
+
+assert(getCatalogTable(tableName).stats.isEmpty)
+partitionDates.foreach { ds =>
+  assert(queryStats(ds).isEmpty)
+}
+
+sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS NOSCAN")
+
+// Table size should also have been updated
+assert(getTableStats(tableName).sizeInBytes > 0)
+// Row count should NOT be updated with the `NOSCAN` option
+assert(getTableStats(tableName).rowCount.isEmpty)
+
+partitionDates.foreach { ds =>
+  val partStats = queryStats(ds)
+  if (partitionStatsEnabled) {
+assert(partStats.nonEmpty)
+assert(partStats.get.sizeInBytes > 0)
+assert(partStats.get.rowCount.isEmpty)
+  } else {
+assert(partStats.isEmpty)
+  }
+}
+
+sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
+
+assert(getTableStats(tableName).sizeInBytes > 0)
+// Table row count should be updated
+assert(getTableStats(tableName).rowCount.get > 0)

Review Comment:
   I think it doesn't matter for this PR. But I still want to know what the 
root cause is?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387523005


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -363,6 +363,85 @@ class StatisticsSuite extends StatisticsCollectionTestBase 
with TestHiveSingleto
 }
   }
 
+  test("SPARK-45731: update partition stats with ANALYZE TABLE") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): Option[CatalogStatistics] = {
+  val partition =
+spark.sessionState.catalog.getPartition(TableIdentifier(tableName), 
Map("ds" -> ds))
+  partition.stats
+}
+
+Seq(true, false).foreach { partitionStatsEnabled =>
+  withSQLConf(SQLConf.UPDATE_PART_STATS_IN_ANALYZE_TABLE_ENABLED.key ->
+  partitionStatsEnabled.toString) {
+withTable(tableName) {
+  withTempPath { path =>
+// Create a table with 3 partitions all located under a directory 
'path'
+sql(
+  s"""
+ |CREATE TABLE $tableName (key STRING, value STRING)
+ |USING hive
+ |PARTITIONED BY (ds STRING)
+ |LOCATION '${path.toURI}'
+  """.stripMargin)
+
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+
+partitionDates.foreach { ds =>
+  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION 
'$path/ds=$ds'")
+  sql("SELECT * FROM src").write.mode(SaveMode.Overwrite)
+  .format("parquet").save(s"$path/ds=$ds")
+}
+
+assert(getCatalogTable(tableName).stats.isEmpty)
+partitionDates.foreach { ds =>
+  assert(queryStats(ds).isEmpty)
+}
+
+sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS NOSCAN")
+
+val expectedRowCount = 25
+
+// Table size should also have been updated
+assert(getTableStats(tableName).sizeInBytes > 0)
+// Row count should NOT be updated with the `NOSCAN` option
+assert(getTableStats(tableName).rowCount.isEmpty)
+
+partitionDates.foreach { ds =>
+  val partStats = queryStats(ds)
+  if (partitionStatsEnabled) {
+assert(partStats.nonEmpty)
+assert(partStats.get.sizeInBytes > 0)
+assert(partStats.get.rowCount.isEmpty)
+  } else {
+assert(partStats.isEmpty)
+  }
+}

Review Comment:
   SGTM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387522474


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala:
##
@@ -101,56 +98,19 @@ case class AnalyzePartitionCommand(
   if (noscan) {
 Map.empty
   } else {
-calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
+CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
   }
 
 // Update the metastore if newly computed statistics are different from 
those
 // recorded in the metastore.
-
-val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, 
tableMeta.identifier,
-  partitions.map(_.storage.locationUri))
-val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-  val newRowCount = rowCounts.get(p.spec)
-  val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
newRowCount)
-  newStats.map(_ => p.copy(stats = newStats))
-}
-
+val (_, newPartitions) = CommandUtils.calculatePartitionStats(
+  sparkSession, tableMeta, partitions, Some(rowCounts))
 if (newPartitions.nonEmpty) {
   sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions)
 }
 
 Seq.empty[Row]
   }
 
-  private def calculateRowCountsPerPartition(

Review Comment:
   I think we shouldn't move `calculateRowCountsPerPartition` if it is used 
only once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43717:
URL: https://github.com/apache/spark/pull/43717#discussion_r1387462013


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -197,7 +197,7 @@ case class Mask(
*/
   override def eval(input: InternalRow): Any = {
 Mask.transformInput(
-  children(0).eval(input),
+  children.head.eval(input),

Review Comment:
   Considering the best practices of scala, personally, I feel the `.head` is 
better than `(0)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45830][CORE] Refactor `StorageUtils#bufferCleaner` to avoid directly using classes under the `sun` package [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43675:
URL: https://github.com/apache/spark/pull/43675#issuecomment-1803202799

   convert to draft first, let me check the performance
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45556][UI] Allow web page respond customized status code and message through WebApplicationException [spark]

2023-11-08 Thread via GitHub



kuwii commented on PR #43646:
URL: https://github.com/apache/spark/pull/43646#issuecomment-1803202085

   Kindly ping @HyukjinKwon . Could you please help to take a look at this PR 
if you think it is OK to do this change? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43691:
URL: https://github.com/apache/spark/pull/43691#issuecomment-1803200302

   Could re-trigger the failed one? make all tasks green is safer @xieshuaihu 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



sunchao commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387505911


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -363,6 +363,83 @@ class StatisticsSuite extends StatisticsCollectionTestBase 
with TestHiveSingleto
 }
   }
 
+  test("SPARK-45731: update partition stats with ANALYZE TABLE") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): Option[CatalogStatistics] = {
+  val partition =
+spark.sessionState.catalog.getPartition(TableIdentifier(tableName), 
Map("ds" -> ds))
+  partition.stats
+}
+
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+
+Seq(true, false).foreach { partitionStatsEnabled =>
+  withSQLConf(SQLConf.UPDATE_PART_STATS_IN_ANALYZE_TABLE_ENABLED.key ->
+partitionStatsEnabled.toString) {
+withTable(tableName) {
+  withTempPath { path =>
+// Create a table with 3 partitions all located under a directory 
'path'
+sql(
+  s"""
+ |CREATE TABLE $tableName (key STRING, value STRING)
+ |USING hive
+ |PARTITIONED BY (ds STRING)
+ |LOCATION '${path.toURI}'
+   """.stripMargin)
+
+partitionDates.foreach { ds =>
+  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION 
'$path/ds=$ds'")
+  sql("SELECT * FROM src").write.mode(SaveMode.Overwrite)
+  .format("parquet").save(s"$path/ds=$ds")
+}
+
+assert(getCatalogTable(tableName).stats.isEmpty)
+partitionDates.foreach { ds =>
+  assert(queryStats(ds).isEmpty)
+}
+
+sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS NOSCAN")
+
+// Table size should also have been updated
+assert(getTableStats(tableName).sizeInBytes > 0)
+// Row count should NOT be updated with the `NOSCAN` option
+assert(getTableStats(tableName).rowCount.isEmpty)
+
+partitionDates.foreach { ds =>
+  val partStats = queryStats(ds)
+  if (partitionStatsEnabled) {
+assert(partStats.nonEmpty)
+assert(partStats.get.sizeInBytes > 0)
+assert(partStats.get.rowCount.isEmpty)
+  } else {
+assert(partStats.isEmpty)
+  }
+}
+
+sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS")
+
+assert(getTableStats(tableName).sizeInBytes > 0)
+// Table row count should be updated
+assert(getTableStats(tableName).rowCount.get > 0)

Review Comment:
   For some reason, the expected row count is different between my local run 
and the run in Spark CI (75 vs 69), so updated to just check if the `rowCount` 
is > 0 here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]

2023-11-08 Thread via GitHub



sandip-db commented on code in PR #43710:
URL: https://github.com/apache/spark/pull/43710#discussion_r1387500399


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##
@@ -66,7 +66,11 @@ private[sql] class XmlOptions(
 
   val compressionCodec = 
parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
   val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim)
-  require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' 
option is required.")
+
+  if (rowTagRequired && rowTagOpt.isEmpty) {
+throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG)

Review Comment:
   That would make `rowTag` required for other paths like `to_xml`, where it is 
ok to use the default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



sunchao commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387500112


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala:
##
@@ -101,56 +98,19 @@ case class AnalyzePartitionCommand(
   if (noscan) {
 Map.empty
   } else {
-calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
+CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
   }
 
 // Update the metastore if newly computed statistics are different from 
those
 // recorded in the metastore.
-
-val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, 
tableMeta.identifier,
-  partitions.map(_.storage.locationUri))
-val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-  val newRowCount = rowCounts.get(p.spec)
-  val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
newRowCount)
-  newStats.map(_ => p.copy(stats = newStats))
-}
-
+val (_, newPartitions) = CommandUtils.calculatePartitionStats(
+  sparkSession, tableMeta, partitions, Some(rowCounts))
 if (newPartitions.nonEmpty) {
   sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions)
 }
 
 Seq.empty[Row]
   }
 
-  private def calculateRowCountsPerPartition(

Review Comment:
   It's now used in `CommandUtils` so moving there and switch to use qualified 
`CommandUtils.calculateRowCountsPerPartition` in this class



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##
@@ -86,19 +91,31 @@ object CommandUtils extends Logging {
   // Calculate table size as a sum of the visible partitions. See 
SPARK-21079
   val partitions = 
sessionState.catalog.listPartitions(catalogTable.identifier)
   logInfo(s"Starting to calculate sizes for ${partitions.length} 
partitions.")
-  val paths = partitions.map(_.storage.locationUri)
-  val sizes = calculateMultipleLocationSizes(spark, 
catalogTable.identifier, paths)
-  val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
None)
-newStats.map(_ => p.copy(stats = newStats))
-  }
+  val (sizes, newPartitions) = calculatePartitionStats(spark, 
catalogTable, partitions,
+partitionRowCount)
   (sizes.sum, newPartitions)

Review Comment:
   Yea we can use `sizes.sum` and save a line here I think



##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -363,6 +363,85 @@ class StatisticsSuite extends StatisticsCollectionTestBase 
with TestHiveSingleto
 }
   }
 
+  test("SPARK-45731: update partition stats with ANALYZE TABLE") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): Option[CatalogStatistics] = {
+  val partition =
+spark.sessionState.catalog.getPartition(TableIdentifier(tableName), 
Map("ds" -> ds))
+  partition.stats
+}
+
+Seq(true, false).foreach { partitionStatsEnabled =>
+  withSQLConf(SQLConf.UPDATE_PART_STATS_IN_ANALYZE_TABLE_ENABLED.key ->
+  partitionStatsEnabled.toString) {
+withTable(tableName) {
+  withTempPath { path =>
+// Create a table with 3 partitions all located under a directory 
'path'
+sql(
+  s"""
+ |CREATE TABLE $tableName (key STRING, value STRING)
+ |USING hive
+ |PARTITIONED BY (ds STRING)
+ |LOCATION '${path.toURI}'
+  """.stripMargin)
+
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+
+partitionDates.foreach { ds =>
+  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION 
'$path/ds=$ds'")
+  sql("SELECT * FROM src").write.mode(SaveMode.Overwrite)
+  .format("parquet").save(s"$path/ds=$ds")
+}
+
+assert(getCatalogTable(tableName).stats.isEmpty)
+partitionDates.foreach { ds =>
+  assert(queryStats(ds).isEmpty)
+}
+
+sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS NOSCAN")
+
+val expectedRowCount = 25
+
+// Table size should also have been updated
+assert(getTableStats(tableName).sizeInBytes > 0)
+// Row count should NOT be updated with the `NOSCAN` option
+assert(getTableStats(tableName).rowCount.isEmpty)
+
+partitionDates.foreach { ds =>
+  val partStats = queryStats(ds)
+  if (partitionStatsEnabled) {
+assert(partStats.nonEmpty)
+assert(partStats.get.sizeInBytes > 0)
+assert(partStats.get.rowCount.isEmpty)
+  } else {
+assert(partStats.isEmpty)
+

Re: [PR] [SPARK-45796][SQL] Support MODE() WITHIN GROUP (ORDER BY col) [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #43663:
URL: https://github.com/apache/spark/pull/43663#issuecomment-1803167968

   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 [spark]

2023-11-08 Thread via GitHub



cloud-fan commented on code in PR #43614:
URL: https://github.com/apache/spark/pull/43614#discussion_r1387469464


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -148,15 +148,39 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   errorClass, missingCol, orderedCandidates, a.origin)
   }
 
+  private def checkUnreferencedCTERelations(
+  cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])],
+  visited: mutable.Map[Long, Boolean],
+  cteId: Long): Unit = {
+if (visited(cteId)) {
+  return
+}
+val (cteDef, _, refMap) = cteMap(cteId)
+refMap.foreach { case (id, _) =>
+  checkUnreferencedCTERelations(cteMap, visited, id)
+}
+checkAnalysis0(cteDef.child)
+visited(cteId) = true
+  }
+
   def checkAnalysis(plan: LogicalPlan): Unit = {
 val inlineCTE = InlineCTE(alwaysInline = true)
 val cteMap = mutable.HashMap.empty[Long, (CTERelationDef, Int, 
mutable.Map[Long, Int])]
 inlineCTE.buildCTEMap(plan, cteMap)
-cteMap.values.foreach { case (relation, refCount, _) =>
+cteMap.values.foreach { case (relation, _, _) =>
   // If a CTE relation is never used, it will disappear after inline. Here 
we explicitly check
   // analysis for it, to make sure the entire query plan is valid.
   try {
-if (refCount == 0) checkAnalysis0(relation.child)
+// If a CTE relation ref count is 0, the other CTE relations that 
reference it
+// should also be checked by checkAnalysis0. This code will also 
guarantee the leaf
+// relations that are not referenced by any others are checked first.

Review Comment:
   ```suggestion
   // relations that do not reference any others are checked first.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43717:
URL: https://github.com/apache/spark/pull/43717#discussion_r1387462013


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -197,7 +197,7 @@ case class Mask(
*/
   override def eval(input: InternalRow): Any = {
 Mask.transformInput(
-  children(0).eval(input),
+  children.head.eval(input),

Review Comment:
   Considering the characteristics and best practices of scala, personally, I 
feel the `.head` is better than `(0)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45847][SQL][TESTS] CliSuite flakiness due to non-sequential guarantee for stdout [spark]

2023-11-08 Thread via GitHub



yaooqinn commented on PR #43725:
URL: https://github.com/apache/spark/pull/43725#issuecomment-1803131732

   cc @cloud-fan @dongjoon-hyun @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #43713:
URL: https://github.com/apache/spark/pull/43713#issuecomment-1803122782

   I checked the  inheritance hierarchy too. I agree @LuciferYang 's 
suggestion, fix it later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45847][SQL][TESTS] CliSuite flakiness due to non-sequential guarantee for stdout [spark]

2023-11-08 Thread via GitHub



yaooqinn opened a new pull request, #43725:
URL: https://github.com/apache/spark/pull/43725

   
   
   
   ### What changes were proposed in this pull request?
   
   
   In CliSuite, This PR adds a retry for tests that write errors to STDERR. 
   
   ### Why are the changes needed?
   
   
   To fix flakiness tests as below
   
https://github.com/chenhao-db/apache-spark/actions/runs/6791437199/job/18463313766
   
https://github.com/dongjoon-hyun/spark/actions/runs/6753670527/job/18361206900
   
   ```sql
   [info]   Spark master: local, Application Id: local-1699402393189
   [info]   spark-sql> /* SELECT /*+ HINT() 4; */;
   [info]   
   [info]   [PARSE_SYNTAX_ERROR] Syntax error at or near ';'. SQLSTATE: 42601 
(line 1, pos 26)
   [info]   
   [info]   == SQL ==
   [info]   /* SELECT /*+ HINT() 4; */;
   [info]   --^^^
   [info]   
   [info]   spark-sql> /* SELECT /*+ HINT() 4; */ SELECT 1;
   [info]   1
   [info]   Time taken: 1.499 seconds, Fetched 1 row(s)
   [info]   
   [info]   [UNCLOSED_BRACKETED_COMMENT] Found an unclosed bracketed comment. 
Please, append */ at the end of the comment. SQLSTATE: 42601
   [info]   == SQL ==
   [info]   /* Here is a unclosed bracketed comment SELECT 1;
   [info]   spark-sql> /* Here is a unclosed bracketed comment SELECT 1;
   [info]   spark-sql> /* SELECT /*+ HINT() */ 4; */;
   [info]   spark-sql> 
   ```
   
   As you can see the fragment above, the query on the 3rd line from the 
bottom, came from STDOUT, was printed later than its error output, came from 
STDERR.
   
   In this scenario, the error output would not match anything and would simply 
go unnoticed. Finally, timed out and failed.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   
   existing tests and CI
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #43717:
URL: https://github.com/apache/spark/pull/43717#issuecomment-1803114073

   > Are other modules to be completed in followup or will continue to update 
this one? @beliefer
   
   I created an umbrella issue, please see 
https://issues.apache.org/jira/browse/SPARK-45823


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45830][CORE] Refactor `StorageUtils#bufferCleaner` to avoid directly using classes under the `sun` package [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43675:
URL: https://github.com/apache/spark/pull/43675#issuecomment-1803110236

   > Do you think there is a chance of performance difference, @LuciferYang ?
   
   `cleanMethod.invoke(cleanerMethod.invoke(buffer))` vs 
`cleanerMethod.invoke(unsafe, buffer)`?
   
   Although the official recommendation is for the former rather than 
reflection, I have previously done some mircabenchmarks on Java 8 and there is 
no significant performance difference between the two. If necessary, I can 
write some more cases to compare them in Java 17.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #43717:
URL: https://github.com/apache/spark/pull/43717#issuecomment-1803109724

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-08 Thread via GitHub



mridulm commented on PR #43627:
URL: https://github.com/apache/spark/pull/43627#issuecomment-1803110054

   > On the SPIP, I did send it to dev as a DISCUSS thread on November 4. I 
would like to get input on that as well.
   
   You are right, I do [see it 
here](https://lists.apache.org/thread/m5pv7vzg3109ts706b13t1c9p5zys2rq) - looks 
like I missed it :-(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43717:
URL: https://github.com/apache/spark/pull/43717#discussion_r1387445107


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -197,7 +197,7 @@ case class Mask(
*/
   override def eval(input: InternalRow): Any = {
 Mask.transformInput(
-  children(0).eval(input),
+  children.head.eval(input),

Review Comment:
   You means the old one is better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] fix typo & remove some unused code [spark]

2023-11-08 Thread via GitHub



panbingkun opened a new pull request, #43724:
URL: https://github.com/apache/spark/pull/43724

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun commented on PR #43709:
URL: https://github.com/apache/spark/pull/43709#issuecomment-1803103862

   Merged to master~ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun closed pull request #43709: [SPARK-45831][CORE][SQL][DSTREAM] Use 
collection factory instead to create immutable Java collections
URL: https://github.com/apache/spark/pull/43709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45835][INFRA] Make gitHub labeler more accurate and remove outdated comments [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun commented on PR #43716:
URL: https://github.com/apache/spark/pull/43716#issuecomment-1803103118

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45835][INFRA] Make gitHub labeler more accurate and remove outdated comments [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun closed pull request #43716: [SPARK-45835][INFRA] Make gitHub 
labeler more accurate and remove outdated comments
URL: https://github.com/apache/spark/pull/43716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43709:
URL: https://github.com/apache/spark/pull/43709#issuecomment-1803103014

   Thanks @dongjoon-hyun ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43717:
URL: https://github.com/apache/spark/pull/43717#issuecomment-1803101246

   Are other modules to be completed in followup or will continue to update 
this one? @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun commented on code in PR #43717:
URL: https://github.com/apache/spark/pull/43717#discussion_r1387438376


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/numberFormatExpressions.scala:
##
@@ -242,7 +242,7 @@ object ToCharacterBuilder extends ExpressionBuilder {
   override def build(funcName: String, expressions: Seq[Expression]): 
Expression = {
 val numArgs = expressions.length
 if (numArgs == 2) {
-  val (inputExpr, format) = (expressions(0), expressions(1))
+  val (inputExpr, format) = (expressions.head, expressions(1))

Review Comment:
   I'm not sure about this part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun commented on code in PR #43717:
URL: https://github.com/apache/spark/pull/43717#discussion_r1387438134


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -197,7 +197,7 @@ case class Mask(
*/
   override def eval(input: InternalRow): Any = {
 Mask.transformInput(
-  children(0).eval(input),
+  children.head.eval(input),

Review Comment:
   In this case, the old one is more consistent to the next lines.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala:
##
@@ -237,7 +237,7 @@ case class Mask(
   ctx: CodegenContext,
   ev: ExprCode,
   f: (String, String, String, String, String) => String): ExprCode = {
-val firstGen = children(0).genCode(ctx)
+val firstGen = children.head.genCode(ctx)

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43713:
URL: https://github.com/apache/spark/pull/43713#issuecomment-1803099217

   hmm...  if scalac doesn't report this warning, my personal suggestion is to 
leave it unfixed. The inheritance hierarchy is as follows:
   
   `AttributeMap` <- `immutable.Map` <- `immutable.MapOps` <- 
`collection.MapOps`
   
   Although the `+` of `collection.MapOps` is marked as deprecated, the `+` of 
`immutable.MapOps` does not appear to be deprecated.
   
   ```scala
 /**
   * Alias for `updated`
   *
   * @param kv the key/value pair.
   * @tparam V1 the type of the value in the key/value pair.
   * @return A new map with the new binding added to this map.
   */
 override def + [V1 >: V](kv: (K, V1)): CC[K, V1] = updated(kv._1, kv._2)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1387436911


##
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct
 -- !query output
-1.01.0 3
+0. 0.  3

Review Comment:
   You can reference the output of other mainstream databases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45825][SQL] Fix some scala compile warnings in module sql/catalyst [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #43717:
URL: https://github.com/apache/spark/pull/43717#issuecomment-1803097528

   ping @dongjoon-hyun @LuciferYang cc @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

2023-11-08 Thread via GitHub



liujiayi771 commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1387434023


##
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct
 -- !query output
-1.01.0 3
+0. 0.  3

Review Comment:
   The result is not incorrect, it is just a precision issue with double. For 
example,
   ```java
   2 / Math.sqrt(2 * 2) = 1.0
   2 / Math.sqrt(2) / Math.sqrt(2) = 0.
   ```
From the user's perspective, 1.0 is more user-friendly. 
   I am currently unsure about whether to sacrifice user-friendliness in order 
to support an extreme case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

2023-11-08 Thread via GitHub



liujiayi771 commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1387434023


##
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct
 -- !query output
-1.01.0 3
+0. 0.  3

Review Comment:
   The result is not incorrect, it is just a precision issue with double. For 
example,
   ```java
   2 / Math.sqrt(2 * 2) = 1.0
   2 / Math.sqrt(2) * Math.sqrt(2) = 0.
   ```
From the user's perspective, 1.0 is more user-friendly. 
   I am currently unsure about whether to sacrifice user-friendliness in order 
to support an extreme case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]

2023-11-08 Thread via GitHub



beliefer commented on PR #43713:
URL: https://github.com/apache/spark/pull/43713#issuecomment-1803089169

   Yes. this warning appeared in `IntelliJ`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43709:
URL: https://github.com/apache/spark/pull/43709#issuecomment-1803085434

   > There are also some places where `Collections.unmodifiable` is used, such 
as
   > 
   > 
https://github.com/apache/spark/blob/bfcd2c47742d69632584f6bb3cf08a8bc3ca8c8c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L85-L102
   > 
   > But I don't think the new API can make them look simpler, so I didn't 
change them in this pr
   
   
   
   @dongjoon-hyun There are three other similar places here, and I think both 
`Collections.unmodifiableMap` and `Map.copyOf` can be used, and `Map.copyOf` 
will trigger a collection copy, so no changes were made in this pr.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42746][SQL] Add the LISTAGG() aggregate function [spark]

2023-11-08 Thread via GitHub



hopefulnick commented on code in PR #42398:
URL: https://github.com/apache/spark/pull/42398#discussion_r1387425067


##
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##
@@ -158,6 +158,69 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
 }
   }
 
+  test("SPARK-42746: listagg function") {
+withTempView("df", "df2") {
+  Seq(("a", "b"), ("a", "c"), ("b", "c"), ("b", "d"), (null, 
null)).toDF("a", "b")
+.createOrReplaceTempView("df")
+  checkAnswer(
+sql("select listagg(b) from df group by a"),
+Row("") :: Row("b,c") :: Row("c,d") :: Nil)
+
+  checkAnswer(
+sql("select listagg(b) from df where 1 != 1"),
+Row("") :: Nil)
+
+  checkAnswer(
+sql("select listagg(b, '|') from df group by a"),
+Row("b|c") :: Row("c|d") :: Row("") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) FROM df"),
+Row("a,a,b,b") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(DISTINCT a) FROM df"),
+Row("a,b") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a) FROM df"),
+Row("a,a,b,b") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a DESC) FROM df"),
+Row("b,b,a,a") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a DESC) " +
+  "OVER (PARTITION BY b) FROM df"),
+Row("a") :: Row("b,a") :: Row("b,a") :: Row("b") :: Row("") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY b) FROM df"),
+Row("a,a,b,b") :: Nil)
+
+  checkAnswer(
+sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY b DESC) FROM df"),

Review Comment:
   when specifying the custom seperator, like ',', it will get 
"b','a','b,','a", not the expected result "b,a,b,a"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43711:
URL: https://github.com/apache/spark/pull/43711#discussion_r1387423207


##
sql/core/src/test/resources/sql-tests/results/group-by.sql.out:
##
@@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
 -- !query schema
 struct
 -- !query output
-1.01.0 3
+0. 0.  3

Review Comment:
   I guess `0.` is incorrect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43710:
URL: https://github.com/apache/spark/pull/43710#discussion_r1387422030


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala:
##
@@ -66,7 +66,11 @@ private[sql] class XmlOptions(
 
   val compressionCodec = 
parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName)
   val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim)
-  require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' 
option is required.")
+
+  if (rowTagRequired && rowTagOpt.isEmpty) {
+throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG)

Review Comment:
   if so, we should change `val rowTag = 
rowTagOpt.getOrElse(XmlOptions.DEFAULT_ROW_TAG)` to `val rowTag = rowTagOpt.get`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]

2023-11-08 Thread via GitHub



panbingkun commented on PR #43713:
URL: https://github.com/apache/spark/pull/43713#issuecomment-1803076916

   A strange and interesting phenomenon is that:
   when I add `"-Wconf:cat=deprecation=Super method:e"` to `SparkBuild`, 
then run `./build/sbt "catalyst/testOnly"`, no error was triggered, but I did 
see `deprecated prompt` in `IntelliJ`. 
   Is it a `false alarm`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45826][SQL] Add a SQL config for extra traces in `Origin` [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43695:
URL: https://github.com/apache/spark/pull/43695#discussion_r1387412665


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -4531,6 +4531,14 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val EXTRA_ORIGIN_TRACES = buildConf("spark.sql.extraOriginTraces")
+.doc("The number of additional non-Spark SQL traces in the captured 
DataFrame context. " +
+  "When it is set to 0, captured one Spark traces and a followed non-Spark 
trace.")
+.version("4.0.0")
+.intConf
+.checkValue(_ >= 0, "The number of extra thread traces must be 
non-negative.")

Review Comment:
   I'm a bit confused.
   Intuitively, I feel that 0 should represent the absence of non-Spark trace.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun commented on code in PR #43709:
URL: https://github.com/apache/spark/pull/43709#discussion_r1387410114


##
common/utils/src/main/java/org/apache/spark/network/util/JavaUtils.java:
##
@@ -202,29 +202,27 @@ private static boolean isSymlink(File file) throws 
IOException {
   private static final Map byteSuffixes;
 
   static {
-final Map timeSuffixesBuilder = new HashMap<>();
-timeSuffixesBuilder.put("us", TimeUnit.MICROSECONDS);
-timeSuffixesBuilder.put("ms", TimeUnit.MILLISECONDS);
-timeSuffixesBuilder.put("s", TimeUnit.SECONDS);
-timeSuffixesBuilder.put("m", TimeUnit.MINUTES);
-timeSuffixesBuilder.put("min", TimeUnit.MINUTES);
-timeSuffixesBuilder.put("h", TimeUnit.HOURS);
-timeSuffixesBuilder.put("d", TimeUnit.DAYS);
-timeSuffixes = Collections.unmodifiableMap(timeSuffixesBuilder);
-
-final Map byteSuffixesBuilder = new HashMap<>();
-byteSuffixesBuilder.put("b", ByteUnit.BYTE);
-byteSuffixesBuilder.put("k", ByteUnit.KiB);
-byteSuffixesBuilder.put("kb", ByteUnit.KiB);
-byteSuffixesBuilder.put("m", ByteUnit.MiB);
-byteSuffixesBuilder.put("mb", ByteUnit.MiB);
-byteSuffixesBuilder.put("g", ByteUnit.GiB);
-byteSuffixesBuilder.put("gb", ByteUnit.GiB);
-byteSuffixesBuilder.put("t", ByteUnit.TiB);
-byteSuffixesBuilder.put("tb", ByteUnit.TiB);
-byteSuffixesBuilder.put("p", ByteUnit.PiB);
-byteSuffixesBuilder.put("pb", ByteUnit.PiB);
-byteSuffixes = Collections.unmodifiableMap(byteSuffixesBuilder);
+  timeSuffixes = Map.of(

Review Comment:
   Indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45842][SQL] Refactor Catalog Function APIs to use analyzer [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun closed pull request #43720: [SPARK-45842][SQL] Refactor Catalog 
Function APIs to use analyzer
URL: https://github.com/apache/spark/pull/43720


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45686][INFRA][CORE][SQL][SS][CONNECT][MLLIB][DSTREAM][AVRO][ML][K8S][YARN][PYTHON][R][UI][GRAPHX][PROTOBUF][TESTS][EXAMPLES] Explicitly convert `Array` to `Seq` when function input is

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43670:
URL: https://github.com/apache/spark/pull/43670#issuecomment-1803060096

   rebased


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43713:
URL: https://github.com/apache/spark/pull/43713#discussion_r1387405692


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends 
Rule[LogicalPlan] {
   case (a: Alias, idx) =>
 val lcaResolved = unwrapLCAReference(a)
 // Insert the original alias instead of rewritten one to detect 
chained LCA
-aliasMap += (a.toAttribute -> AliasEntry(a, idx))
+aliasMap =
+  AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> 
AliasEntry(a, idx)))

Review Comment:
   Sounds good. But it seems created extra AttributeMap.
   How about `aliasMap = AttributeMap(aliasMap.updated(a.toAttribute, 
AliasEntry(a, idx)))`?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends 
Rule[LogicalPlan] {
   case (a: Alias, idx) =>
 val lcaResolved = unwrapLCAReference(a)
 // Insert the original alias instead of rewritten one to detect 
chained LCA
-aliasMap += (a.toAttribute -> AliasEntry(a, idx))
+aliasMap =
+  AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> 
AliasEntry(a, idx)))

Review Comment:
   Sounds good. But it seems created extra `AttributeMap`.
   How about `aliasMap = AttributeMap(aliasMap.updated(a.toAttribute, 
AliasEntry(a, idx)))`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45686][INFRA][CORE][SQL][SS][CONNECT][MLLIB][DSTREAM][AVRO][ML][K8S][YARN][PYTHON][R][UI][GRAPHX][PROTOBUF][TESTS][EXAMPLES] Explicitly convert `Array` to `Seq` when function input is

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43670:
URL: https://github.com/apache/spark/pull/43670#issuecomment-1803057104

   > So this should help performance in 2.13 by avoiding defensive copies, and 
instead wrapping as 2.12 did?
   
   Yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45686][INFRA][CORE][SQL][SS][CONNECT][MLLIB][DSTREAM][AVRO][ML][K8S][YARN][PYTHON][R][UI][GRAPHX][PROTOBUF][TESTS][EXAMPLES] Explicitly convert `Array` to `Seq` when function input is

2023-11-08 Thread via GitHub



LuciferYang commented on PR #43670:
URL: https://github.com/apache/spark/pull/43670#issuecomment-1803056696

   > UserDefinedPythonDataSource
   
   Yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]

2023-11-08 Thread via GitHub



liujiayi771 commented on PR #43711:
URL: https://github.com/apache/spark/pull/43711#issuecomment-1803054998

   The potential side effect of this modification is that it is easier to 
obtain a finite number for `sqrt(xMk * yMk)`, while `sqrt(xMk) * sqrt(yMk)` can 
easily result in an infinite number, for example,
   ```java
   Math.sqrt(2 * 2) = 2.0
   Math.sqrt(2) * Math.sqrt(2) = 2.0004
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



beliefer commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387395341


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##
@@ -86,19 +91,31 @@ object CommandUtils extends Logging {
   // Calculate table size as a sum of the visible partitions. See 
SPARK-21079
   val partitions = 
sessionState.catalog.listPartitions(catalogTable.identifier)
   logInfo(s"Starting to calculate sizes for ${partitions.length} 
partitions.")
-  val paths = partitions.map(_.storage.locationUri)
-  val sizes = calculateMultipleLocationSizes(spark, 
catalogTable.identifier, paths)
-  val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
None)
-newStats.map(_ => p.copy(stats = newStats))
-  }
+  val (sizes, newPartitions) = calculatePartitionStats(spark, 
catalogTable, partitions,
+partitionRowCount)
   (sizes.sum, newPartitions)
 }
 logInfo(s"It took ${(System.nanoTime() - startTime) / (1000 * 1000)} ms to 
calculate" +
   s" the total size for table ${catalogTable.identifier}.")
 (totalSize, newPartitions)
   }
 
+  def calculatePartitionStats(
+  spark: SparkSession,
+  catalogTable: CatalogTable,
+  partitions: Seq[CatalogTablePartition],
+  partitionRowCount: Option[Map[TablePartitionSpec, BigInt]] = None):
+  (Seq[Long], Seq[CatalogTablePartition]) = {
+val paths = partitions.map(_.storage.locationUri)
+val sizes = calculateMultipleLocationSizes(spark, catalogTable.identifier, 
paths)
+val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
+  val newRowCount = partitionRowCount.flatMap(_.get(p.spec))
+  val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
newRowCount)
+  newStats.map(_ => p.copy(stats = newStats))
+}
+(sizes, newPartitions)

Review Comment:
   ```suggestion
   (sizes.sum, newPartitions)
   ```



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala:
##
@@ -101,56 +98,19 @@ case class AnalyzePartitionCommand(
   if (noscan) {
 Map.empty
   } else {
-calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
+CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, 
partitionValueSpec)
   }
 
 // Update the metastore if newly computed statistics are different from 
those
 // recorded in the metastore.
-
-val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, 
tableMeta.identifier,
-  partitions.map(_.storage.locationUri))
-val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-  val newRowCount = rowCounts.get(p.spec)
-  val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
newRowCount)
-  newStats.map(_ => p.copy(stats = newStats))
-}
-
+val (_, newPartitions) = CommandUtils.calculatePartitionStats(
+  sparkSession, tableMeta, partitions, Some(rowCounts))
 if (newPartitions.nonEmpty) {
   sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions)
 }
 
 Seq.empty[Row]
   }
 
-  private def calculateRowCountsPerPartition(

Review Comment:
   Is `calculateRowCountsPerPartition` shared with other caller?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala:
##
@@ -86,19 +91,31 @@ object CommandUtils extends Logging {
   // Calculate table size as a sum of the visible partitions. See 
SPARK-21079
   val partitions = 
sessionState.catalog.listPartitions(catalogTable.identifier)
   logInfo(s"Starting to calculate sizes for ${partitions.length} 
partitions.")
-  val paths = partitions.map(_.storage.locationUri)
-  val sizes = calculateMultipleLocationSizes(spark, 
catalogTable.identifier, paths)
-  val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) =>
-val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), 
None)
-newStats.map(_ => p.copy(stats = newStats))
-  }
+  val (sizes, newPartitions) = calculatePartitionStats(spark, 
catalogTable, partitions,
+partitionRowCount)
   (sizes.sum, newPartitions)

Review Comment:
   We can delete this line if `calculatePartitionStats` returns `sizes.sum`.



##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -363,6 +363,85 @@ class StatisticsSuite extends StatisticsCollectionTestBase 
with TestHiveSingleto
 }
   }
 
+  test("SPARK-45731: update partition stats with ANALYZE TABLE") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): Option[CatalogStatistics] = {
+  val partition =
+spark.sessionState.catalog.getPartition(TableIdentifier(tableName), 
Map("ds" ->

Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]

2023-11-08 Thread via GitHub



panbingkun commented on code in PR #43713:
URL: https://github.com/apache/spark/pull/43713#discussion_r1387392114


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends 
Rule[LogicalPlan] {
   case (a: Alias, idx) =>
 val lcaResolved = unwrapLCAReference(a)
 // Insert the original alias instead of rewritten one to detect 
chained LCA
-aliasMap += (a.toAttribute -> AliasEntry(a, idx))
+aliasMap =
+  AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> 
AliasEntry(a, idx)))

Review Comment:
   Maybe so that we can hide access to internal values as much as possible.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends 
Rule[LogicalPlan] {
   case (a: Alias, idx) =>
 val lcaResolved = unwrapLCAReference(a)
 // Insert the original alias instead of rewritten one to detect 
chained LCA
-aliasMap += (a.toAttribute -> AliasEntry(a, idx))
+aliasMap =
+  AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> 
AliasEntry(a, idx)))

Review Comment:
   Maybe so that we can hide access to `internal values` as much as possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]

2023-11-08 Thread via GitHub



panbingkun commented on code in PR #43713:
URL: https://github.com/apache/spark/pull/43713#discussion_r1387391412


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala:
##
@@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends 
Rule[LogicalPlan] {
   case (a: Alias, idx) =>
 val lcaResolved = unwrapLCAReference(a)
 // Insert the original alias instead of rewritten one to detect 
chained LCA
-aliasMap += (a.toAttribute -> AliasEntry(a, idx))
+aliasMap =
+  AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> 
AliasEntry(a, idx)))

Review Comment:
   ```suggestion
 aliasMap ++= AttributeMap(Map(a.toAttribute -> AliasEntry(a, 
idx)))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Add support for java 17 from spark 3.5.0 [spark-docker]

2023-11-08 Thread via GitHub



Yikun commented on PR #56:
URL: https://github.com/apache/spark-docker/pull/56#issuecomment-1803034637

   Thanks for your efforts @vakarisbk , I'm going to merge this PR later today 
or tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-08 Thread via GitHub



abellina commented on PR #43627:
URL: https://github.com/apache/spark/pull/43627#issuecomment-1803024221

   @mridulm thanks for the comments. I think they mostly work and I’m testing 
it at the moment. I’ll push something either tonight or tomorrow am.
   
   On the SPIP, I did send it to dev as a DISCUSS thread on November 4. I would 
like to get input on that as well.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45829][DOCS] Update the default value for spark.executor.logs.rolling.maxSize [spark]

2023-11-08 Thread via GitHub



chenyu-opensource commented on PR #43712:
URL: https://github.com/apache/spark/pull/43712#issuecomment-1803002444

   > Thanks. Merged to master 3.5 3.4 3.3
   
   Thank you so much for your review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]

2023-11-08 Thread via GitHub



xieshuaihu commented on PR #43691:
URL: https://github.com/apache/spark/pull/43691#issuecomment-1803001325

   It seems like that this test failure is unrelated to this pr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



sunchao commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387358004


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2671,6 +2671,16 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ANALYZE_PARTITION_STATS_ENABLED =
+buildConf("spark.sql.statistics.update.partitionStats.enabled")

Review Comment:
   Thanks, updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



sunchao commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387343974


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2671,6 +2671,16 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ANALYZE_PARTITION_STATS_ENABLED =
+buildConf("spark.sql.statistics.update.partitionStats.enabled")

Review Comment:
   Hmm, how about 
`spark.sql.statistics.updatePartitionStatsInAnalyzeTable.enabled`? it is a bit 
long but can fully express the intention  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387354165


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2671,6 +2671,16 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ANALYZE_PARTITION_STATS_ENABLED =
+buildConf("spark.sql.statistics.update.partitionStats.enabled")

Review Comment:
   +1 for the new name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



sunchao commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387343974


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2671,6 +2671,16 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ANALYZE_PARTITION_STATS_ENABLED =
+buildConf("spark.sql.statistics.update.partitionStats.enabled")

Review Comment:
   Hmm, how about 
`spark.sql.statistics.updatePartitionStatsInAnalyzeTable.enabled`? it is a bit 
long but can fully expression the intention  



##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2671,6 +2671,16 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ANALYZE_PARTITION_STATS_ENABLED =
+buildConf("spark.sql.statistics.update.partitionStats.enabled")

Review Comment:
   Hmm, how about 
`spark.sql.statistics.updatePartitionStatsInAnalyzeTable.enabled`? it is a bit 
long but can fully expression the intention  



##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -372,54 +372,70 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   partition.stats
 }
 
-withTable(tableName) {
-  withTempPath { path =>
-// Create a table with 3 partitions all located under a single 
top-level directory 'path'
-sql(
-  s"""
- |CREATE TABLE $tableName (key STRING, value STRING)
- |USING hive
- |PARTITIONED BY (ds STRING)
- |LOCATION '${path.toURI}'
- """.stripMargin)
+Seq(true, false).foreach { partitionStatsEnabled =>
+  withSQLConf(SQLConf.ANALYZE_PARTITION_STATS_ENABLED.key -> 
partitionStatsEnabled.toString) {
+withTable(tableName) {
+  withTempPath { path =>
+// Create a table with 3 partitions all located under a directory 
'path'
+sql(
+  s"""
+ |CREATE TABLE $tableName (key STRING, value STRING)
+ |USING hive
+ |PARTITIONED BY (ds STRING)
+ |LOCATION '${path.toURI}'
+  """.stripMargin)
 
-val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
 
-partitionDates.foreach { ds =>
-  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION 
'$path/ds=$ds'")
-  sql("SELECT * FROM src").write.mode(SaveMode.Overwrite)
-  .format("parquet").save(s"$path/ds=$ds")
-}
+partitionDates.foreach { ds =>
+  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION 
'$path/ds=$ds'")
+  sql("SELECT * from src").write.mode(SaveMode.Overwrite)

Review Comment:
   oops will revert this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387340831


##
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala:
##
@@ -372,54 +372,70 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   partition.stats
 }
 
-withTable(tableName) {
-  withTempPath { path =>
-// Create a table with 3 partitions all located under a single 
top-level directory 'path'
-sql(
-  s"""
- |CREATE TABLE $tableName (key STRING, value STRING)
- |USING hive
- |PARTITIONED BY (ds STRING)
- |LOCATION '${path.toURI}'
- """.stripMargin)
+Seq(true, false).foreach { partitionStatsEnabled =>
+  withSQLConf(SQLConf.ANALYZE_PARTITION_STATS_ENABLED.key -> 
partitionStatsEnabled.toString) {
+withTable(tableName) {
+  withTempPath { path =>
+// Create a table with 3 partitions all located under a directory 
'path'
+sql(
+  s"""
+ |CREATE TABLE $tableName (key STRING, value STRING)
+ |USING hive
+ |PARTITIONED BY (ds STRING)
+ |LOCATION '${path.toURI}'
+  """.stripMargin)
 
-val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
 
-partitionDates.foreach { ds =>
-  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION 
'$path/ds=$ds'")
-  sql("SELECT * FROM src").write.mode(SaveMode.Overwrite)
-  .format("parquet").save(s"$path/ds=$ds")
-}
+partitionDates.foreach { ds =>
+  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION 
'$path/ds=$ds'")
+  sql("SELECT * from src").write.mode(SaveMode.Overwrite)

Review Comment:
   Let's keep the original. It was `FROM` instead of `from`..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun commented on code in PR #43629:
URL: https://github.com/apache/spark/pull/43629#discussion_r1387340391


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2671,6 +2671,16 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val ANALYZE_PARTITION_STATS_ENABLED =
+buildConf("spark.sql.statistics.update.partitionStats.enabled")

Review Comment:
   1. Do we have other configurations under `spark.sql.statistics.update.*`? We 
should not add a new namespace for a single configuration. Shall we rename like 
the following?
   ```
   - spark.sql.statistics.update.partitionStats.enabled
   + spark.sql.statistics.updatePartitionStats.enabled
   ```
   
   2. Also, the current config name looks too general. Can we revise the config 
name to give some idea about `ANALYZE TABLE` syntax?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]

2023-11-08 Thread via GitHub



sunchao commented on PR #43629:
URL: https://github.com/apache/spark/pull/43629#issuecomment-1802927767

   Thanks all! Updated the PR with a new flag.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45845][SS][UI] Add number of evicted state rows to streaming UI [spark]

2023-11-08 Thread via GitHub



WweiL commented on PR #43723:
URL: https://github.com/apache/spark/pull/43723#issuecomment-1802783939

   cc @HeartSaVioR  can you take a look please : )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45845][SS][UI] Add number of evicted state rows to streaming UI [spark]

2023-11-08 Thread via GitHub



WweiL opened a new pull request, #43723:
URL: https://github.com/apache/spark/pull/43723

   
   
   ### What changes were proposed in this pull request?
   
   In a stateful query, a watermark has two responsibilities:
   1. drop late rows
   2. Evict state rows from state store
   
   Before we only log "aggregated number of rows dropped by watermark". This is 
case 1. But people would confuse this with case 2. This PR purpose we also add 
a chart for case 2. Also made the explanation for case 1 more verbose.
   
   Now:
   https://github.com/apache/spark/assets/10248890/b1f7f94f-94fe-490b-b2d9-abbb7170b67b;>
   
   
   ### Why are the changes needed?
   
   UI improvement
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, users will be able to view the new UI
   
   ### How was this patch tested?
   
   Manually tested
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 [spark]

2023-11-08 Thread via GitHub



amaliujia commented on code in PR #43614:
URL: https://github.com/apache/spark/pull/43614#discussion_r1387214224


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -156,7 +171,15 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   // If a CTE relation is never used, it will disappear after inline. Here 
we explicitly check
   // analysis for it, to make sure the entire query plan is valid.
   try {
-if (refCount == 0) checkAnalysis0(relation.child)

Review Comment:
   Ok I added some comments to explain we are trying to check the leaf relation 
first because that gives the most accurate error message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]

2023-11-08 Thread via GitHub



chenhao-db commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1387206814


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala:
##
@@ -326,6 +327,17 @@ case class PhysicalStructType(fields: Array[StructField]) 
extends PhysicalDataTy
   }
 }
 
+class PhysicalVariantType extends PhysicalDataType {

Review Comment:
   I don't feel there is anything special and worth a comment in this class. 
All `Physical*Type` classes in this file don't have a comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45843][CORE] Support `killall` in REST Submission API [spark]

2023-11-08 Thread via GitHub



dongjoon-hyun closed pull request #43721: [SPARK-45843][CORE] Support `killall` 
in REST Submission API
URL: https://github.com/apache/spark/pull/43721


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 206 matches

Mail list logo