Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]
LuciferYang commented on PR #43691: URL: https://github.com/apache/spark/pull/43691#issuecomment-1803327334 Merged into master and branch-3.5. Thanks @xieshuaihu @hvanhovell @dongjoon-hyun @cfmcgrady This patch conflicts with branch-3.4. Could you submit a separate pr for branch-3.4? @xieshuaihu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]
LuciferYang closed pull request #43691: [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak URL: https://github.com/apache/spark/pull/43691 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]
MaxGekk commented on code in PR #43710: URL: https://github.com/apache/spark/pull/43710#discussion_r1387600755 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -3899,6 +3899,12 @@ }, "sqlState" : "42605" }, + "XML_ROW_TAG_OPTION_REQUIRED" : { Review Comment: Yep, I agree. The error class name should point out a problem, possible solutions should be in `message`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]
xieshuaihu commented on PR #43691: URL: https://github.com/apache/spark/pull/43691#issuecomment-1803309496 @LuciferYang all ut passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45826][SQL] Add a SQL config for extra traces in `Origin` [spark]
MaxGekk commented on code in PR #43695: URL: https://github.com/apache/spark/pull/43695#discussion_r1387598312 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4531,6 +4531,14 @@ object SQLConf { .booleanConf .createWithDefault(false) + val EXTRA_ORIGIN_TRACES = buildConf("spark.sql.extraOriginTraces") +.doc("The number of additional non-Spark SQL traces in the captured DataFrame context. " + + "When it is set to 0, captured one Spark traces and a followed non-Spark trace.") +.version("4.0.0") +.intConf +.checkValue(_ >= 0, "The number of extra thread traces must be non-negative.") Review Comment: > Intuitively, I feel that 0 should represent the absence of non-Spark trace. Actually, it works in this way. Let me modify the config and PR description. The `slice` method excludes the `until` index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45770][SQL][PYTHON][CONNECT] Introduce logical plan `UnresolvedDropColumns` for `Dataframe.drop` [spark]
cloud-fan commented on code in PR #43683: URL: https://github.com/apache/spark/pull/43683#discussion_r1387588910 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDataFrameDropColumns.scala: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.catalyst.plans.logical.{DataFrameDropColumns, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.DF_DROP_COLUMNS +import org.apache.spark.sql.connector.catalog.CatalogManager + +/** + * A rule that rewrites DropColumns to Project. + */ +class ResolveDataFrameDropColumns(val catalogManager: CatalogManager) + extends Rule[LogicalPlan] with ColumnResolutionHelper { + + override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( +_.containsPattern(DF_DROP_COLUMNS)) { +case d: DataFrameDropColumns if d.childrenResolved => + val dropped = d.dropList.map { +case u: UnresolvedAttribute => Review Comment: is this for `df1.drop(df2.col)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 [spark]
cloud-fan closed pull request #43614: [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 URL: https://github.com/apache/spark/pull/43614 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 [spark]
cloud-fan commented on PR #43614: URL: https://github.com/apache/spark/pull/43614#issuecomment-1803285997 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
cloud-fan commented on code in PR #43707: URL: https://github.com/apache/spark/pull/43707#discussion_r1387580155 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java: ## @@ -0,0 +1,124 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.unsafe.types; + +import org.apache.spark.unsafe.Platform; + +import java.io.Serializable; +import java.util.Arrays; + +/** + * The physical data representation of {@link org.apache.spark.sql.types.VariantType} that + * represents a semi-structured value. It consists of two binary values: {@link VariantVal#value} + * and {@link VariantVal#metadata}. The value encodes types and values, but not field names. The + * metadata currently contains a version flag and a list of field names. We can extend/modify the + * detailed binary format given the version flag. + * + * A {@link VariantVal} can be produced by casting another value into the Variant type or parsing a + * JSON string in the {@link org.apache.spark.sql.catalyst.expressions.variant.ParseJson} + * expression. We can extract a path consisting of field names and array indices from it, cast it + * into a concrete data type, or rebuild a JSON string from it. + * + * The storage layout of this class in {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} + * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: the fixed-size part is + * a long value "offsetAndSize". The upper 32 bits is the offset that points to the start position + * of the actual binary content. The lower 32 bits is the total length of the binary content. The + * binary content contains: 4 bytes representing the length of {@link VariantVal#value}, content of + * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is an internal and + * transient format and can be modified at any time. + */ +public class VariantVal implements Serializable { + protected final byte[] value; + protected final byte[] metadata; + + public VariantVal(byte[] value, byte[] metadata) { +this.value = value; +this.metadata = metadata; + } + + public byte[] getValue() { +return value; + } + + public byte[] getMetadata() { +return metadata; + } + + /** + * This function writes the binary content into {@code buffer} starting from {@code cursor}, as + * described in the class comment. The caller should guarantee there is enough space in `buffer`. + */ + public void writeIntoUnsafeRow(byte[] buffer, long cursor) { Review Comment: interesting, `UnsafeWriter` always use `byte[]`, then it's fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]
LuciferYang commented on PR #43709: URL: https://github.com/apache/spark/pull/43709#issuecomment-1803284125 Thanks @beliefer ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
cloud-fan commented on PR #43707: URL: https://github.com/apache/spark/pull/43707#issuecomment-1803280319 We should also check all the call sites of `DataSource#disallowWritingIntervals` and also disallow writing variants to DS v1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
cloud-fan commented on PR #43707: URL: https://github.com/apache/spark/pull/43707#issuecomment-1803277101 Can we check `FileFormat#supportDataType` and make sure only Parquet supports it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]
beliefer commented on code in PR #43710: URL: https://github.com/apache/spark/pull/43710#discussion_r1387573558 ## common/utils/src/main/resources/error/error-classes.json: ## @@ -3899,6 +3899,12 @@ }, "sqlState" : "42605" }, + "XML_ROW_TAG_OPTION_REQUIRED" : { Review Comment: How about `XML_ROW_TAG_MISSING`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45830][CORE] Refactor `StorageUtils#bufferCleaner` to avoid directly using classes under the `sun` package [spark]
LuciferYang commented on code in PR #43675: URL: https://github.com/apache/spark/pull/43675#discussion_r1387574129 ## core/src/main/scala/org/apache/spark/storage/StorageUtils.scala: ## @@ -197,13 +195,18 @@ private[spark] class StorageStatus( /** Helper methods for storage-related objects. */ private[spark] object StorageUtils extends Logging { - private val bufferCleaner: DirectBuffer => Unit = { -val cleanerMethod = - Utils.classForName("sun.misc.Unsafe").getMethod("invokeCleaner", classOf[ByteBuffer]) -val unsafeField = classOf[Unsafe].getDeclaredField("theUnsafe") -unsafeField.setAccessible(true) -val unsafe = unsafeField.get(null).asInstanceOf[Unsafe] -buffer: DirectBuffer => cleanerMethod.invoke(unsafe, buffer) + private val bufferCleaner: ByteBuffer => Unit = { +val cleanerClass = Utils.classForName("jdk.internal.ref.Cleaner") +val directBufferClass = Utils.classForName("sun.nio.ch.DirectBuffer") +val byteBufferLookup: MethodHandles.Lookup = + MethodHandles.privateLookupIn(directBufferClass, MethodHandles.lookup()) +val cleanerMethod: MethodHandle = byteBufferLookup + .findVirtual(directBufferClass, "cleaner", MethodType.methodType(cleanerClass)) +val cleanerLookup: MethodHandles.Lookup = + MethodHandles.privateLookupIn(cleanerClass, MethodHandles.lookup()) +val cleanMethod: MethodHandle = + cleanerLookup.findVirtual(cleanerClass, "clean", MethodType.methodType(classOf[Unit])) +buffer: ByteBuffer => cleanMethod.invoke(cleanerMethod.invoke(buffer)) Review Comment: The new method is indeed slower than the old method because there are two calls to the method handle. Let me think about what to do. @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]
beliefer commented on code in PR #43710: URL: https://github.com/apache/spark/pull/43710#discussion_r1387571783 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala: ## @@ -66,7 +66,11 @@ private[sql] class XmlOptions( val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName) val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim) - require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' option is required.") + + if (rowTagRequired && rowTagOpt.isEmpty) { +throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG) Review Comment: I got it now. I'm missing the `rowTagRequired`. The `rowTagRequired` is false if the call the `to_xml`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
cloud-fan commented on code in PR #43707: URL: https://github.com/apache/spark/pull/43707#discussion_r1387567758 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala: ## @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.variant + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types._ + +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(jsonStr) - Parse a JSON string as an Variant value. Throw an exception when the string is not valid JSON value.", + examples = +""" +Examples: + """, + since = "3.4.0", Review Comment: `4.0.0` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
chenhao-db commented on code in PR #43707: URL: https://github.com/apache/spark/pull/43707#discussion_r1387567192 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java: ## @@ -0,0 +1,124 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.unsafe.types; + +import org.apache.spark.unsafe.Platform; + +import java.io.Serializable; +import java.util.Arrays; + +/** + * The physical data representation of {@link org.apache.spark.sql.types.VariantType} that + * represents a semi-structured value. It consists of two binary values: {@link VariantVal#value} + * and {@link VariantVal#metadata}. The value encodes types and values, but not field names. The + * metadata currently contains a version flag and a list of field names. We can extend/modify the + * detailed binary format given the version flag. + * + * A {@link VariantVal} can be produced by casting another value into the Variant type or parsing a + * JSON string in the {@link org.apache.spark.sql.catalyst.expressions.variant.ParseJson} + * expression. We can extract a path consisting of field names and array indices from it, cast it + * into a concrete data type, or rebuild a JSON string from it. + * + * The storage layout of this class in {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} + * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: the fixed-size part is + * a long value "offsetAndSize". The upper 32 bits is the offset that points to the start position + * of the actual binary content. The lower 32 bits is the total length of the binary content. The + * binary content contains: 4 bytes representing the length of {@link VariantVal#value}, content of + * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is an internal and + * transient format and can be modified at any time. + */ +public class VariantVal implements Serializable { + protected final byte[] value; + protected final byte[] metadata; + + public VariantVal(byte[] value, byte[] metadata) { +this.value = value; +this.metadata = metadata; + } + + public byte[] getValue() { +return value; + } + + public byte[] getMetadata() { +return metadata; + } + + /** + * This function writes the binary content into {@code buffer} starting from {@code cursor}, as + * described in the class comment. The caller should guarantee there is enough space in `buffer`. + */ + public void writeIntoUnsafeRow(byte[] buffer, long cursor) { +Platform.putInt(buffer, cursor, value.length); +Platform.copyMemory(value, Platform.BYTE_ARRAY_OFFSET, buffer, cursor + 4, value.length); +Platform.copyMemory( +metadata, +Platform.BYTE_ARRAY_OFFSET, +buffer, +cursor + 4 + value.length, +metadata.length +); + } + + /** + * This function reads the binary content described in `writeIntoUnsafeRow` from `baseObject`. The + * offset is computed by adding the offset in {@code offsetAndSize} and {@code baseOffset}. + */ + public static VariantVal readFromUnsafeRow(long offsetAndSize, Object baseObject, + long baseOffset) { +// offset and totalSize is the upper/lower 32 bits in offsetAndSize. +int offset = (int) (offsetAndSize >> 32); +int totalSize = (int) offsetAndSize; +int valueSize = Platform.getInt(baseObject, baseOffset + offset); +int metadataSize = totalSize - 4 - valueSize; +byte[] value = new byte[valueSize]; +byte[] metadata = new byte[metadataSize]; +Platform.copyMemory( Review Comment: I know how `UTF8String` works, but I feel it is simpler to have `byte[]` in the `VariantVal` object instead of `baseObject + baseOffset`. I prefer to have this version first and it is not something unchangeable in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]
sandip-db commented on code in PR #43710: URL: https://github.com/apache/spark/pull/43710#discussion_r1387566546 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala: ## @@ -66,7 +66,11 @@ private[sql] class XmlOptions( val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName) val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim) - require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' option is required.") + + if (rowTagRequired && rowTagOpt.isEmpty) { +throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG) Review Comment: Currently, `to_xml` doesn't throw an error if `rowTag` is not specified because it uses the default. If we do the change that you suggested above, even `to_xml` will require `rowTag` and throw an error if one is not provided. This PR is just standardizing the error that gets thrown when rowTag is not provided on "read". It doesn't change any other behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
cloud-fan commented on code in PR #43707: URL: https://github.com/apache/spark/pull/43707#discussion_r1387563169 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java: ## @@ -0,0 +1,124 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.unsafe.types; + +import org.apache.spark.unsafe.Platform; + +import java.io.Serializable; +import java.util.Arrays; + +/** + * The physical data representation of {@link org.apache.spark.sql.types.VariantType} that + * represents a semi-structured value. It consists of two binary values: {@link VariantVal#value} + * and {@link VariantVal#metadata}. The value encodes types and values, but not field names. The + * metadata currently contains a version flag and a list of field names. We can extend/modify the + * detailed binary format given the version flag. + * + * A {@link VariantVal} can be produced by casting another value into the Variant type or parsing a + * JSON string in the {@link org.apache.spark.sql.catalyst.expressions.variant.ParseJson} + * expression. We can extract a path consisting of field names and array indices from it, cast it + * into a concrete data type, or rebuild a JSON string from it. + * + * The storage layout of this class in {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} + * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: the fixed-size part is + * a long value "offsetAndSize". The upper 32 bits is the offset that points to the start position + * of the actual binary content. The lower 32 bits is the total length of the binary content. The + * binary content contains: 4 bytes representing the length of {@link VariantVal#value}, content of + * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is an internal and + * transient format and can be modified at any time. + */ +public class VariantVal implements Serializable { + protected final byte[] value; + protected final byte[] metadata; + + public VariantVal(byte[] value, byte[] metadata) { +this.value = value; +this.metadata = metadata; + } + + public byte[] getValue() { +return value; + } + + public byte[] getMetadata() { +return metadata; + } + + /** + * This function writes the binary content into {@code buffer} starting from {@code cursor}, as + * described in the class comment. The caller should guarantee there is enough space in `buffer`. + */ + public void writeIntoUnsafeRow(byte[] buffer, long cursor) { +Platform.putInt(buffer, cursor, value.length); +Platform.copyMemory(value, Platform.BYTE_ARRAY_OFFSET, buffer, cursor + 4, value.length); +Platform.copyMemory( +metadata, +Platform.BYTE_ARRAY_OFFSET, +buffer, +cursor + 4 + value.length, +metadata.length +); + } + + /** + * This function reads the binary content described in `writeIntoUnsafeRow` from `baseObject`. The + * offset is computed by adding the offset in {@code offsetAndSize} and {@code baseOffset}. + */ + public static VariantVal readFromUnsafeRow(long offsetAndSize, Object baseObject, + long baseOffset) { +// offset and totalSize is the upper/lower 32 bits in offsetAndSize. +int offset = (int) (offsetAndSize >> 32); +int totalSize = (int) offsetAndSize; +int valueSize = Platform.getInt(baseObject, baseOffset + offset); +int metadataSize = totalSize - 4 - valueSize; +byte[] value = new byte[valueSize]; +byte[] metadata = new byte[metadataSize]; +Platform.copyMemory( Review Comment: We can avoid copy if `VariantVal` can follow `UTF8String` and also represent its data as `baseObject + baseOffset`. Does Java have nicer way to do it now? cc @rednaxelafx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe,
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
chenhao-db commented on code in PR #43707: URL: https://github.com/apache/spark/pull/43707#discussion_r1387562510 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java: ## @@ -0,0 +1,124 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.unsafe.types; + +import org.apache.spark.unsafe.Platform; + +import java.io.Serializable; +import java.util.Arrays; + +/** + * The physical data representation of {@link org.apache.spark.sql.types.VariantType} that + * represents a semi-structured value. It consists of two binary values: {@link VariantVal#value} + * and {@link VariantVal#metadata}. The value encodes types and values, but not field names. The + * metadata currently contains a version flag and a list of field names. We can extend/modify the + * detailed binary format given the version flag. + * + * A {@link VariantVal} can be produced by casting another value into the Variant type or parsing a + * JSON string in the {@link org.apache.spark.sql.catalyst.expressions.variant.ParseJson} + * expression. We can extract a path consisting of field names and array indices from it, cast it + * into a concrete data type, or rebuild a JSON string from it. + * + * The storage layout of this class in {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} + * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: the fixed-size part is + * a long value "offsetAndSize". The upper 32 bits is the offset that points to the start position + * of the actual binary content. The lower 32 bits is the total length of the binary content. The + * binary content contains: 4 bytes representing the length of {@link VariantVal#value}, content of + * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is an internal and + * transient format and can be modified at any time. + */ +public class VariantVal implements Serializable { + protected final byte[] value; + protected final byte[] metadata; + + public VariantVal(byte[] value, byte[] metadata) { +this.value = value; +this.metadata = metadata; + } + + public byte[] getValue() { +return value; + } + + public byte[] getMetadata() { +return metadata; + } + + /** + * This function writes the binary content into {@code buffer} starting from {@code cursor}, as + * described in the class comment. The caller should guarantee there is enough space in `buffer`. + */ + public void writeIntoUnsafeRow(byte[] buffer, long cursor) { Review Comment: I'm not quite sure what you mean. This function is called by `UnsafeWriter`, which always uses `byte[]` to build an `UnsafeRow`. There is no benefit of changing this function to use a generic base object. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]
LuciferYang commented on PR #43717: URL: https://github.com/apache/spark/pull/43717#issuecomment-1803258795 > > Otherwise, I'm worried that these problems will arise after some time. > > Good question. But I feel the `scalafmt` can't resolve these issue. @LuciferYang Do you know how to add the check rules of IntelliJ to the Spark builder? These are two inspection systems, at least now I don't know how to do it We need some additional investigation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]
beliefer commented on code in PR #43710: URL: https://github.com/apache/spark/pull/43710#discussion_r1387559815 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala: ## @@ -66,7 +66,11 @@ private[sql] class XmlOptions( val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName) val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim) - require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' option is required.") + + if (rowTagRequired && rowTagOpt.isEmpty) { +throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG) Review Comment: So, the error appears if users call `to_xml` and do not specify the rowTag option? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
cloud-fan commented on code in PR #43707: URL: https://github.com/apache/spark/pull/43707#discussion_r1387559447 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java: ## @@ -0,0 +1,124 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.unsafe.types; + +import org.apache.spark.unsafe.Platform; + +import java.io.Serializable; +import java.util.Arrays; + +/** + * The physical data representation of {@link org.apache.spark.sql.types.VariantType} that + * represents a semi-structured value. It consists of two binary values: {@link VariantVal#value} + * and {@link VariantVal#metadata}. The value encodes types and values, but not field names. The + * metadata currently contains a version flag and a list of field names. We can extend/modify the + * detailed binary format given the version flag. + * + * A {@link VariantVal} can be produced by casting another value into the Variant type or parsing a + * JSON string in the {@link org.apache.spark.sql.catalyst.expressions.variant.ParseJson} + * expression. We can extract a path consisting of field names and array indices from it, cast it + * into a concrete data type, or rebuild a JSON string from it. + * + * The storage layout of this class in {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} + * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: the fixed-size part is + * a long value "offsetAndSize". The upper 32 bits is the offset that points to the start position + * of the actual binary content. The lower 32 bits is the total length of the binary content. The + * binary content contains: 4 bytes representing the length of {@link VariantVal#value}, content of + * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is an internal and + * transient format and can be modified at any time. + */ +public class VariantVal implements Serializable { + protected final byte[] value; + protected final byte[] metadata; + + public VariantVal(byte[] value, byte[] metadata) { +this.value = value; +this.metadata = metadata; + } + + public byte[] getValue() { +return value; + } + + public byte[] getMetadata() { +return metadata; + } + + /** + * This function writes the binary content into {@code buffer} starting from {@code cursor}, as + * described in the class comment. The caller should guarantee there is enough space in `buffer`. + */ + public void writeIntoUnsafeRow(byte[] buffer, long cursor) { Review Comment: there is no `byte[]` if Spark is using offheap mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
cloud-fan commented on code in PR #43707: URL: https://github.com/apache/spark/pull/43707#discussion_r1387559229 ## common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java: ## @@ -0,0 +1,124 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.unsafe.types; + +import org.apache.spark.unsafe.Platform; + +import java.io.Serializable; +import java.util.Arrays; + +/** + * The physical data representation of {@link org.apache.spark.sql.types.VariantType} that + * represents a semi-structured value. It consists of two binary values: {@link VariantVal#value} + * and {@link VariantVal#metadata}. The value encodes types and values, but not field names. The + * metadata currently contains a version flag and a list of field names. We can extend/modify the + * detailed binary format given the version flag. + * + * A {@link VariantVal} can be produced by casting another value into the Variant type or parsing a + * JSON string in the {@link org.apache.spark.sql.catalyst.expressions.variant.ParseJson} + * expression. We can extract a path consisting of field names and array indices from it, cast it + * into a concrete data type, or rebuild a JSON string from it. + * + * The storage layout of this class in {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} + * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: the fixed-size part is + * a long value "offsetAndSize". The upper 32 bits is the offset that points to the start position + * of the actual binary content. The lower 32 bits is the total length of the binary content. The + * binary content contains: 4 bytes representing the length of {@link VariantVal#value}, content of + * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is an internal and + * transient format and can be modified at any time. + */ +public class VariantVal implements Serializable { + protected final byte[] value; + protected final byte[] metadata; + + public VariantVal(byte[] value, byte[] metadata) { +this.value = value; +this.metadata = metadata; + } + + public byte[] getValue() { +return value; + } + + public byte[] getMetadata() { +return metadata; + } + + /** + * This function writes the binary content into {@code buffer} starting from {@code cursor}, as + * described in the class comment. The caller should guarantee there is enough space in `buffer`. + */ + public void writeIntoUnsafeRow(byte[] buffer, long cursor) { Review Comment: ```suggestion public void writeIntoUnsafeRow(Object baseObject, long baseOffset, long cursor) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
beliefer commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387554850 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala: ## @@ -101,56 +98,19 @@ case class AnalyzePartitionCommand( if (noscan) { Map.empty } else { -calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) +CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) } // Update the metastore if newly computed statistics are different from those // recorded in the metastore. - -val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, tableMeta.identifier, - partitions.map(_.storage.locationUri)) -val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => - val newRowCount = rowCounts.get(p.spec) - val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), newRowCount) - newStats.map(_ => p.copy(stats = newStats)) -} - +val (_, newPartitions) = CommandUtils.calculatePartitionStats( + sparkSession, tableMeta, partitions, Some(rowCounts)) if (newPartitions.nonEmpty) { sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions) } Seq.empty[Row] } - private def calculateRowCountsPerPartition( Review Comment: Is calculateRowCountsPerPartition used only one place? ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala: ## @@ -101,56 +98,19 @@ case class AnalyzePartitionCommand( if (noscan) { Map.empty } else { -calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) +CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) } // Update the metastore if newly computed statistics are different from those // recorded in the metastore. - -val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, tableMeta.identifier, - partitions.map(_.storage.locationUri)) -val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => - val newRowCount = rowCounts.get(p.spec) - val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), newRowCount) - newStats.map(_ => p.copy(stats = newStats)) -} - +val (_, newPartitions) = CommandUtils.calculatePartitionStats( + sparkSession, tableMeta, partitions, Some(rowCounts)) if (newPartitions.nonEmpty) { sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions) } Seq.empty[Row] } - private def calculateRowCountsPerPartition( Review Comment: Is `calculateRowCountsPerPartition` used only one place? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]
beliefer commented on PR #43717: URL: https://github.com/apache/spark/pull/43717#issuecomment-1803250355 > Otherwise, I'm worried that these problems will arise after some time. Good question. But I feel the `scalafmt` can't resolve these issue. @LuciferYang Do you know how to add the check rules of IntelliJ to the Spark builder? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45655][SQL][SS] Allow non-deterministic expressions inside AggregateFunctions in CollectMetrics [spark]
beliefer commented on PR #43517: URL: https://github.com/apache/spark/pull/43517#issuecomment-1803246795 Please check `SPARK-45655: Use current batch timestamp in observe API *** FAILED *** (241 milliseconds)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45848] Make `spark-version-info.properties` generated by `spark-build-info.ps1` include `docroot` [spark]
LuciferYang commented on PR #43726: URL: https://github.com/apache/spark/pull/43726#issuecomment-1803235338 cc @wangyum FYI Step to check with ps on mac: 1. install ps: https://github.com/powershell/powershell#get-powershell 2. run `export USERNAME=${userName}` 3. run `/usr/local/bin/pwsh -File build/spark-build-info.ps1 ${basedir} ${spark-version}` 4. run `cat ${basedir}/spark-version-info.properties` to check result -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]
panbingkun commented on PR #43717: URL: https://github.com/apache/spark/pull/43717#issuecomment-1803227663 Do we have any `tools` that can help prohibit the `next time`? Eg: scalafmt ? Otherwise, I'm worried that these problems will arise after some time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45848] Make `spark-version-info.properties` generated by `spark-build-info.ps1` include `docroot` [spark]
LuciferYang opened a new pull request, #43726: URL: https://github.com/apache/spark/pull/43726 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
sunchao commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387527377 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala: ## @@ -101,56 +98,19 @@ case class AnalyzePartitionCommand( if (noscan) { Map.empty } else { -calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) +CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) } // Update the metastore if newly computed statistics are different from those // recorded in the metastore. - -val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, tableMeta.identifier, - partitions.map(_.storage.locationUri)) -val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => - val newRowCount = rowCounts.get(p.spec) - val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), newRowCount) - newStats.map(_ => p.copy(stats = newStats)) -} - +val (_, newPartitions) = CommandUtils.calculatePartitionStats( + sparkSession, tableMeta, partitions, Some(rowCounts)) if (newPartitions.nonEmpty) { sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions) } Seq.empty[Row] } - private def calculateRowCountsPerPartition( Review Comment: Hmm what do you mean? it is used multiple times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44307][SQL] Add Bloom filter for left outer join even if the left side table is smaller than broadcast threshold. [spark]
beliefer commented on PR #41860: URL: https://github.com/apache/spark/pull/41860#issuecomment-1803217192 @maheshk114 Could you rebase this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
beliefer commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387524303 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -363,6 +363,83 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("SPARK-45731: update partition stats with ANALYZE TABLE") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): Option[CatalogStatistics] = { + val partition = +spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats +} + +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + +Seq(true, false).foreach { partitionStatsEnabled => + withSQLConf(SQLConf.UPDATE_PART_STATS_IN_ANALYZE_TABLE_ENABLED.key -> +partitionStatsEnabled.toString) { +withTable(tableName) { + withTempPath { path => +// Create a table with 3 partitions all located under a directory 'path' +sql( + s""" + |CREATE TABLE $tableName (key STRING, value STRING) + |USING hive + |PARTITIONED BY (ds STRING) + |LOCATION '${path.toURI}' + """.stripMargin) + +partitionDates.foreach { ds => + sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION '$path/ds=$ds'") + sql("SELECT * FROM src").write.mode(SaveMode.Overwrite) + .format("parquet").save(s"$path/ds=$ds") +} + +assert(getCatalogTable(tableName).stats.isEmpty) +partitionDates.foreach { ds => + assert(queryStats(ds).isEmpty) +} + +sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS NOSCAN") + +// Table size should also have been updated +assert(getTableStats(tableName).sizeInBytes > 0) +// Row count should NOT be updated with the `NOSCAN` option +assert(getTableStats(tableName).rowCount.isEmpty) + +partitionDates.foreach { ds => + val partStats = queryStats(ds) + if (partitionStatsEnabled) { +assert(partStats.nonEmpty) +assert(partStats.get.sizeInBytes > 0) +assert(partStats.get.rowCount.isEmpty) + } else { +assert(partStats.isEmpty) + } +} + +sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS") + +assert(getTableStats(tableName).sizeInBytes > 0) +// Table row count should be updated +assert(getTableStats(tableName).rowCount.get > 0) Review Comment: I think it doesn't matter for this PR. But I still want to know what the root cause is? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
beliefer commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387523005 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -363,6 +363,85 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("SPARK-45731: update partition stats with ANALYZE TABLE") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): Option[CatalogStatistics] = { + val partition = +spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats +} + +Seq(true, false).foreach { partitionStatsEnabled => + withSQLConf(SQLConf.UPDATE_PART_STATS_IN_ANALYZE_TABLE_ENABLED.key -> + partitionStatsEnabled.toString) { +withTable(tableName) { + withTempPath { path => +// Create a table with 3 partitions all located under a directory 'path' +sql( + s""" + |CREATE TABLE $tableName (key STRING, value STRING) + |USING hive + |PARTITIONED BY (ds STRING) + |LOCATION '${path.toURI}' + """.stripMargin) + +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + +partitionDates.foreach { ds => + sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION '$path/ds=$ds'") + sql("SELECT * FROM src").write.mode(SaveMode.Overwrite) + .format("parquet").save(s"$path/ds=$ds") +} + +assert(getCatalogTable(tableName).stats.isEmpty) +partitionDates.foreach { ds => + assert(queryStats(ds).isEmpty) +} + +sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS NOSCAN") + +val expectedRowCount = 25 + +// Table size should also have been updated +assert(getTableStats(tableName).sizeInBytes > 0) +// Row count should NOT be updated with the `NOSCAN` option +assert(getTableStats(tableName).rowCount.isEmpty) + +partitionDates.foreach { ds => + val partStats = queryStats(ds) + if (partitionStatsEnabled) { +assert(partStats.nonEmpty) +assert(partStats.get.sizeInBytes > 0) +assert(partStats.get.rowCount.isEmpty) + } else { +assert(partStats.isEmpty) + } +} Review Comment: SGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
beliefer commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387522474 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala: ## @@ -101,56 +98,19 @@ case class AnalyzePartitionCommand( if (noscan) { Map.empty } else { -calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) +CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) } // Update the metastore if newly computed statistics are different from those // recorded in the metastore. - -val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, tableMeta.identifier, - partitions.map(_.storage.locationUri)) -val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => - val newRowCount = rowCounts.get(p.spec) - val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), newRowCount) - newStats.map(_ => p.copy(stats = newStats)) -} - +val (_, newPartitions) = CommandUtils.calculatePartitionStats( + sparkSession, tableMeta, partitions, Some(rowCounts)) if (newPartitions.nonEmpty) { sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions) } Seq.empty[Row] } - private def calculateRowCountsPerPartition( Review Comment: I think we shouldn't move `calculateRowCountsPerPartition` if it is used only once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]
beliefer commented on code in PR #43717: URL: https://github.com/apache/spark/pull/43717#discussion_r1387462013 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala: ## @@ -197,7 +197,7 @@ case class Mask( */ override def eval(input: InternalRow): Any = { Mask.transformInput( - children(0).eval(input), + children.head.eval(input), Review Comment: Considering the best practices of scala, personally, I feel the `.head` is better than `(0)`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45830][CORE] Refactor `StorageUtils#bufferCleaner` to avoid directly using classes under the `sun` package [spark]
LuciferYang commented on PR #43675: URL: https://github.com/apache/spark/pull/43675#issuecomment-1803202799 convert to draft first, let me check the performance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45556][UI] Allow web page respond customized status code and message through WebApplicationException [spark]
kuwii commented on PR #43646: URL: https://github.com/apache/spark/pull/43646#issuecomment-1803202085 Kindly ping @HyukjinKwon . Could you please help to take a look at this PR if you think it is OK to do this change? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]
LuciferYang commented on PR #43691: URL: https://github.com/apache/spark/pull/43691#issuecomment-1803200302 Could re-trigger the failed one? make all tasks green is safer @xieshuaihu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
sunchao commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387505911 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -363,6 +363,83 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("SPARK-45731: update partition stats with ANALYZE TABLE") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): Option[CatalogStatistics] = { + val partition = +spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats +} + +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + +Seq(true, false).foreach { partitionStatsEnabled => + withSQLConf(SQLConf.UPDATE_PART_STATS_IN_ANALYZE_TABLE_ENABLED.key -> +partitionStatsEnabled.toString) { +withTable(tableName) { + withTempPath { path => +// Create a table with 3 partitions all located under a directory 'path' +sql( + s""" + |CREATE TABLE $tableName (key STRING, value STRING) + |USING hive + |PARTITIONED BY (ds STRING) + |LOCATION '${path.toURI}' + """.stripMargin) + +partitionDates.foreach { ds => + sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION '$path/ds=$ds'") + sql("SELECT * FROM src").write.mode(SaveMode.Overwrite) + .format("parquet").save(s"$path/ds=$ds") +} + +assert(getCatalogTable(tableName).stats.isEmpty) +partitionDates.foreach { ds => + assert(queryStats(ds).isEmpty) +} + +sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS NOSCAN") + +// Table size should also have been updated +assert(getTableStats(tableName).sizeInBytes > 0) +// Row count should NOT be updated with the `NOSCAN` option +assert(getTableStats(tableName).rowCount.isEmpty) + +partitionDates.foreach { ds => + val partStats = queryStats(ds) + if (partitionStatsEnabled) { +assert(partStats.nonEmpty) +assert(partStats.get.sizeInBytes > 0) +assert(partStats.get.rowCount.isEmpty) + } else { +assert(partStats.isEmpty) + } +} + +sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS") + +assert(getTableStats(tableName).sizeInBytes > 0) +// Table row count should be updated +assert(getTableStats(tableName).rowCount.get > 0) Review Comment: For some reason, the expected row count is different between my local run and the run in Spark CI (75 vs 69), so updated to just check if the `rowCount` is > 0 here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]
sandip-db commented on code in PR #43710: URL: https://github.com/apache/spark/pull/43710#discussion_r1387500399 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala: ## @@ -66,7 +66,11 @@ private[sql] class XmlOptions( val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName) val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim) - require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' option is required.") + + if (rowTagRequired && rowTagOpt.isEmpty) { +throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG) Review Comment: That would make `rowTag` required for other paths like `to_xml`, where it is ok to use the default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
sunchao commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387500112 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala: ## @@ -101,56 +98,19 @@ case class AnalyzePartitionCommand( if (noscan) { Map.empty } else { -calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) +CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) } // Update the metastore if newly computed statistics are different from those // recorded in the metastore. - -val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, tableMeta.identifier, - partitions.map(_.storage.locationUri)) -val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => - val newRowCount = rowCounts.get(p.spec) - val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), newRowCount) - newStats.map(_ => p.copy(stats = newStats)) -} - +val (_, newPartitions) = CommandUtils.calculatePartitionStats( + sparkSession, tableMeta, partitions, Some(rowCounts)) if (newPartitions.nonEmpty) { sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions) } Seq.empty[Row] } - private def calculateRowCountsPerPartition( Review Comment: It's now used in `CommandUtils` so moving there and switch to use qualified `CommandUtils.calculateRowCountsPerPartition` in this class ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala: ## @@ -86,19 +91,31 @@ object CommandUtils extends Logging { // Calculate table size as a sum of the visible partitions. See SPARK-21079 val partitions = sessionState.catalog.listPartitions(catalogTable.identifier) logInfo(s"Starting to calculate sizes for ${partitions.length} partitions.") - val paths = partitions.map(_.storage.locationUri) - val sizes = calculateMultipleLocationSizes(spark, catalogTable.identifier, paths) - val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => -val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), None) -newStats.map(_ => p.copy(stats = newStats)) - } + val (sizes, newPartitions) = calculatePartitionStats(spark, catalogTable, partitions, +partitionRowCount) (sizes.sum, newPartitions) Review Comment: Yea we can use `sizes.sum` and save a line here I think ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -363,6 +363,85 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("SPARK-45731: update partition stats with ANALYZE TABLE") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): Option[CatalogStatistics] = { + val partition = +spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> ds)) + partition.stats +} + +Seq(true, false).foreach { partitionStatsEnabled => + withSQLConf(SQLConf.UPDATE_PART_STATS_IN_ANALYZE_TABLE_ENABLED.key -> + partitionStatsEnabled.toString) { +withTable(tableName) { + withTempPath { path => +// Create a table with 3 partitions all located under a directory 'path' +sql( + s""" + |CREATE TABLE $tableName (key STRING, value STRING) + |USING hive + |PARTITIONED BY (ds STRING) + |LOCATION '${path.toURI}' + """.stripMargin) + +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + +partitionDates.foreach { ds => + sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION '$path/ds=$ds'") + sql("SELECT * FROM src").write.mode(SaveMode.Overwrite) + .format("parquet").save(s"$path/ds=$ds") +} + +assert(getCatalogTable(tableName).stats.isEmpty) +partitionDates.foreach { ds => + assert(queryStats(ds).isEmpty) +} + +sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS NOSCAN") + +val expectedRowCount = 25 + +// Table size should also have been updated +assert(getTableStats(tableName).sizeInBytes > 0) +// Row count should NOT be updated with the `NOSCAN` option +assert(getTableStats(tableName).rowCount.isEmpty) + +partitionDates.foreach { ds => + val partStats = queryStats(ds) + if (partitionStatsEnabled) { +assert(partStats.nonEmpty) +assert(partStats.get.sizeInBytes > 0) +assert(partStats.get.rowCount.isEmpty) + } else { +assert(partStats.isEmpty) +
Re: [PR] [SPARK-45796][SQL] Support MODE() WITHIN GROUP (ORDER BY col) [spark]
beliefer commented on PR #43663: URL: https://github.com/apache/spark/pull/43663#issuecomment-1803167968 ping @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 [spark]
cloud-fan commented on code in PR #43614: URL: https://github.com/apache/spark/pull/43614#discussion_r1387469464 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -148,15 +148,39 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB errorClass, missingCol, orderedCandidates, a.origin) } + private def checkUnreferencedCTERelations( + cteMap: mutable.Map[Long, (CTERelationDef, Int, mutable.Map[Long, Int])], + visited: mutable.Map[Long, Boolean], + cteId: Long): Unit = { +if (visited(cteId)) { + return +} +val (cteDef, _, refMap) = cteMap(cteId) +refMap.foreach { case (id, _) => + checkUnreferencedCTERelations(cteMap, visited, id) +} +checkAnalysis0(cteDef.child) +visited(cteId) = true + } + def checkAnalysis(plan: LogicalPlan): Unit = { val inlineCTE = InlineCTE(alwaysInline = true) val cteMap = mutable.HashMap.empty[Long, (CTERelationDef, Int, mutable.Map[Long, Int])] inlineCTE.buildCTEMap(plan, cteMap) -cteMap.values.foreach { case (relation, refCount, _) => +cteMap.values.foreach { case (relation, _, _) => // If a CTE relation is never used, it will disappear after inline. Here we explicitly check // analysis for it, to make sure the entire query plan is valid. try { -if (refCount == 0) checkAnalysis0(relation.child) +// If a CTE relation ref count is 0, the other CTE relations that reference it +// should also be checked by checkAnalysis0. This code will also guarantee the leaf +// relations that are not referenced by any others are checked first. Review Comment: ```suggestion // relations that do not reference any others are checked first. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]
beliefer commented on code in PR #43717: URL: https://github.com/apache/spark/pull/43717#discussion_r1387462013 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala: ## @@ -197,7 +197,7 @@ case class Mask( */ override def eval(input: InternalRow): Any = { Mask.transformInput( - children(0).eval(input), + children.head.eval(input), Review Comment: Considering the characteristics and best practices of scala, personally, I feel the `.head` is better than `(0)`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45847][SQL][TESTS] CliSuite flakiness due to non-sequential guarantee for stdout [spark]
yaooqinn commented on PR #43725: URL: https://github.com/apache/spark/pull/43725#issuecomment-1803131732 cc @cloud-fan @dongjoon-hyun @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]
beliefer commented on PR #43713: URL: https://github.com/apache/spark/pull/43713#issuecomment-1803122782 I checked the inheritance hierarchy too. I agree @LuciferYang 's suggestion, fix it later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45847][SQL][TESTS] CliSuite flakiness due to non-sequential guarantee for stdout [spark]
yaooqinn opened a new pull request, #43725: URL: https://github.com/apache/spark/pull/43725 ### What changes were proposed in this pull request? In CliSuite, This PR adds a retry for tests that write errors to STDERR. ### Why are the changes needed? To fix flakiness tests as below https://github.com/chenhao-db/apache-spark/actions/runs/6791437199/job/18463313766 https://github.com/dongjoon-hyun/spark/actions/runs/6753670527/job/18361206900 ```sql [info] Spark master: local, Application Id: local-1699402393189 [info] spark-sql> /* SELECT /*+ HINT() 4; */; [info] [info] [PARSE_SYNTAX_ERROR] Syntax error at or near ';'. SQLSTATE: 42601 (line 1, pos 26) [info] [info] == SQL == [info] /* SELECT /*+ HINT() 4; */; [info] --^^^ [info] [info] spark-sql> /* SELECT /*+ HINT() 4; */ SELECT 1; [info] 1 [info] Time taken: 1.499 seconds, Fetched 1 row(s) [info] [info] [UNCLOSED_BRACKETED_COMMENT] Found an unclosed bracketed comment. Please, append */ at the end of the comment. SQLSTATE: 42601 [info] == SQL == [info] /* Here is a unclosed bracketed comment SELECT 1; [info] spark-sql> /* Here is a unclosed bracketed comment SELECT 1; [info] spark-sql> /* SELECT /*+ HINT() */ 4; */; [info] spark-sql> ``` As you can see the fragment above, the query on the 3rd line from the bottom, came from STDOUT, was printed later than its error output, came from STDERR. In this scenario, the error output would not match anything and would simply go unnoticed. Finally, timed out and failed. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests and CI ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]
beliefer commented on PR #43717: URL: https://github.com/apache/spark/pull/43717#issuecomment-1803114073 > Are other modules to be completed in followup or will continue to update this one? @beliefer I created an umbrella issue, please see https://issues.apache.org/jira/browse/SPARK-45823 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45830][CORE] Refactor `StorageUtils#bufferCleaner` to avoid directly using classes under the `sun` package [spark]
LuciferYang commented on PR #43675: URL: https://github.com/apache/spark/pull/43675#issuecomment-1803110236 > Do you think there is a chance of performance difference, @LuciferYang ? `cleanMethod.invoke(cleanerMethod.invoke(buffer))` vs `cleanerMethod.invoke(unsafe, buffer)`? Although the official recommendation is for the former rather than reflection, I have previously done some mircabenchmarks on Java 8 and there is no significant performance difference between the two. If necessary, I can write some more cases to compare them in Java 17. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]
beliefer commented on PR #43717: URL: https://github.com/apache/spark/pull/43717#issuecomment-1803109724 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]
mridulm commented on PR #43627: URL: https://github.com/apache/spark/pull/43627#issuecomment-1803110054 > On the SPIP, I did send it to dev as a DISCUSS thread on November 4. I would like to get input on that as well. You are right, I do [see it here](https://lists.apache.org/thread/m5pv7vzg3109ts706b13t1c9p5zys2rq) - looks like I missed it :-( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL][CORE][GRAPHX] Fix some scala compile warnings in module sql/catalyst [spark]
beliefer commented on code in PR #43717: URL: https://github.com/apache/spark/pull/43717#discussion_r1387445107 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala: ## @@ -197,7 +197,7 @@ case class Mask( */ override def eval(input: InternalRow): Any = { Mask.transformInput( - children(0).eval(input), + children.head.eval(input), Review Comment: You means the old one is better? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] fix typo & remove some unused code [spark]
panbingkun opened a new pull request, #43724: URL: https://github.com/apache/spark/pull/43724 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]
dongjoon-hyun commented on PR #43709: URL: https://github.com/apache/spark/pull/43709#issuecomment-1803103862 Merged to master~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]
dongjoon-hyun closed pull request #43709: [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections URL: https://github.com/apache/spark/pull/43709 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45835][INFRA] Make gitHub labeler more accurate and remove outdated comments [spark]
dongjoon-hyun commented on PR #43716: URL: https://github.com/apache/spark/pull/43716#issuecomment-1803103118 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45835][INFRA] Make gitHub labeler more accurate and remove outdated comments [spark]
dongjoon-hyun closed pull request #43716: [SPARK-45835][INFRA] Make gitHub labeler more accurate and remove outdated comments URL: https://github.com/apache/spark/pull/43716 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]
LuciferYang commented on PR #43709: URL: https://github.com/apache/spark/pull/43709#issuecomment-1803103014 Thanks @dongjoon-hyun ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL] Fix some scala compile warnings in module sql/catalyst [spark]
LuciferYang commented on PR #43717: URL: https://github.com/apache/spark/pull/43717#issuecomment-1803101246 Are other modules to be completed in followup or will continue to update this one? @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL] Fix some scala compile warnings in module sql/catalyst [spark]
dongjoon-hyun commented on code in PR #43717: URL: https://github.com/apache/spark/pull/43717#discussion_r1387438376 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/numberFormatExpressions.scala: ## @@ -242,7 +242,7 @@ object ToCharacterBuilder extends ExpressionBuilder { override def build(funcName: String, expressions: Seq[Expression]): Expression = { val numArgs = expressions.length if (numArgs == 2) { - val (inputExpr, format) = (expressions(0), expressions(1)) + val (inputExpr, format) = (expressions.head, expressions(1)) Review Comment: I'm not sure about this part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL] Fix some scala compile warnings in module sql/catalyst [spark]
dongjoon-hyun commented on code in PR #43717: URL: https://github.com/apache/spark/pull/43717#discussion_r1387438134 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala: ## @@ -197,7 +197,7 @@ case class Mask( */ override def eval(input: InternalRow): Any = { Mask.transformInput( - children(0).eval(input), + children.head.eval(input), Review Comment: In this case, the old one is more consistent to the next lines. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala: ## @@ -237,7 +237,7 @@ case class Mask( ctx: CodegenContext, ev: ExprCode, f: (String, String, String, String, String) => String): ExprCode = { -val firstGen = children(0).genCode(ctx) +val firstGen = children.head.genCode(ctx) Review Comment: ditto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]
LuciferYang commented on PR #43713: URL: https://github.com/apache/spark/pull/43713#issuecomment-1803099217 hmm... if scalac doesn't report this warning, my personal suggestion is to leave it unfixed. The inheritance hierarchy is as follows: `AttributeMap` <- `immutable.Map` <- `immutable.MapOps` <- `collection.MapOps` Although the `+` of `collection.MapOps` is marked as deprecated, the `+` of `immutable.MapOps` does not appear to be deprecated. ```scala /** * Alias for `updated` * * @param kv the key/value pair. * @tparam V1 the type of the value in the key/value pair. * @return A new map with the new binding added to this map. */ override def + [V1 >: V](kv: (K, V1)): CC[K, V1] = updated(kv._1, kv._2) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]
beliefer commented on code in PR #43711: URL: https://github.com/apache/spark/pull/43711#discussion_r1387436911 ## sql/core/src/test/resources/sql-tests/results/group-by.sql.out: ## @@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) -- !query schema struct -- !query output -1.01.0 3 +0. 0. 3 Review Comment: You can reference the output of other mainstream databases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45825][SQL] Fix some scala compile warnings in module sql/catalyst [spark]
beliefer commented on PR #43717: URL: https://github.com/apache/spark/pull/43717#issuecomment-1803097528 ping @dongjoon-hyun @LuciferYang cc @panbingkun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]
liujiayi771 commented on code in PR #43711: URL: https://github.com/apache/spark/pull/43711#discussion_r1387434023 ## sql/core/src/test/resources/sql-tests/results/group-by.sql.out: ## @@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) -- !query schema struct -- !query output -1.01.0 3 +0. 0. 3 Review Comment: The result is not incorrect, it is just a precision issue with double. For example, ```java 2 / Math.sqrt(2 * 2) = 1.0 2 / Math.sqrt(2) / Math.sqrt(2) = 0. ``` From the user's perspective, 1.0 is more user-friendly. I am currently unsure about whether to sacrifice user-friendliness in order to support an extreme case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]
liujiayi771 commented on code in PR #43711: URL: https://github.com/apache/spark/pull/43711#discussion_r1387434023 ## sql/core/src/test/resources/sql-tests/results/group-by.sql.out: ## @@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) -- !query schema struct -- !query output -1.01.0 3 +0. 0. 3 Review Comment: The result is not incorrect, it is just a precision issue with double. For example, ```java 2 / Math.sqrt(2 * 2) = 1.0 2 / Math.sqrt(2) * Math.sqrt(2) = 0. ``` From the user's perspective, 1.0 is more user-friendly. I am currently unsure about whether to sacrifice user-friendliness in order to support an extreme case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]
beliefer commented on PR #43713: URL: https://github.com/apache/spark/pull/43713#issuecomment-1803089169 Yes. this warning appeared in `IntelliJ`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]
LuciferYang commented on PR #43709: URL: https://github.com/apache/spark/pull/43709#issuecomment-1803085434 > There are also some places where `Collections.unmodifiable` is used, such as > > https://github.com/apache/spark/blob/bfcd2c47742d69632584f6bb3cf08a8bc3ca8c8c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L85-L102 > > But I don't think the new API can make them look simpler, so I didn't change them in this pr @dongjoon-hyun There are three other similar places here, and I think both `Collections.unmodifiableMap` and `Map.copyOf` can be used, and `Map.copyOf` will trigger a collection copy, so no changes were made in this pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42746][SQL] Add the LISTAGG() aggregate function [spark]
hopefulnick commented on code in PR #42398: URL: https://github.com/apache/spark/pull/42398#discussion_r1387425067 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala: ## @@ -158,6 +158,69 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark } } + test("SPARK-42746: listagg function") { +withTempView("df", "df2") { + Seq(("a", "b"), ("a", "c"), ("b", "c"), ("b", "d"), (null, null)).toDF("a", "b") +.createOrReplaceTempView("df") + checkAnswer( +sql("select listagg(b) from df group by a"), +Row("") :: Row("b,c") :: Row("c,d") :: Nil) + + checkAnswer( +sql("select listagg(b) from df where 1 != 1"), +Row("") :: Nil) + + checkAnswer( +sql("select listagg(b, '|') from df group by a"), +Row("b|c") :: Row("c|d") :: Row("") :: Nil) + + checkAnswer( +sql("SELECT LISTAGG(a) FROM df"), +Row("a,a,b,b") :: Nil) + + checkAnswer( +sql("SELECT LISTAGG(DISTINCT a) FROM df"), +Row("a,b") :: Nil) + + checkAnswer( +sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a) FROM df"), +Row("a,a,b,b") :: Nil) + + checkAnswer( +sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a DESC) FROM df"), +Row("b,b,a,a") :: Nil) + + checkAnswer( +sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY a DESC) " + + "OVER (PARTITION BY b) FROM df"), +Row("a") :: Row("b,a") :: Row("b,a") :: Row("b") :: Row("") :: Nil) + + checkAnswer( +sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY b) FROM df"), +Row("a,a,b,b") :: Nil) + + checkAnswer( +sql("SELECT LISTAGG(a) WITHIN GROUP (ORDER BY b DESC) FROM df"), Review Comment: when specifying the custom seperator, like ',', it will get "b','a','b,','a", not the expected result "b,a,b,a" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]
beliefer commented on code in PR #43711: URL: https://github.com/apache/spark/pull/43711#discussion_r1387423207 ## sql/core/src/test/resources/sql-tests/results/group-by.sql.out: ## @@ -338,7 +338,7 @@ SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) -- !query schema struct -- !query output -1.01.0 3 +0. 0. 3 Review Comment: I guess `0.` is incorrect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45562][SQL] XML: Add SQL error class for missing rowTag option [spark]
beliefer commented on code in PR #43710: URL: https://github.com/apache/spark/pull/43710#discussion_r1387422030 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala: ## @@ -66,7 +66,11 @@ private[sql] class XmlOptions( val compressionCodec = parameters.get(COMPRESSION).map(CompressionCodecs.getCodecClassName) val rowTagOpt = parameters.get(XmlOptions.ROW_TAG).map(_.trim) - require(!rowTagRequired || rowTagOpt.isDefined, s"'${XmlOptions.ROW_TAG}' option is required.") + + if (rowTagRequired && rowTagOpt.isEmpty) { +throw QueryCompilationErrors.xmlRowTagRequiredError(XmlOptions.ROW_TAG) Review Comment: if so, we should change `val rowTag = rowTagOpt.getOrElse(XmlOptions.DEFAULT_ROW_TAG)` to `val rowTag = rowTagOpt.get` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]
panbingkun commented on PR #43713: URL: https://github.com/apache/spark/pull/43713#issuecomment-1803076916 A strange and interesting phenomenon is that: when I add `"-Wconf:cat=deprecation=Super method:e"` to `SparkBuild`, then run `./build/sbt "catalyst/testOnly"`, no error was triggered, but I did see `deprecated prompt` in `IntelliJ`. Is it a `false alarm`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45826][SQL] Add a SQL config for extra traces in `Origin` [spark]
beliefer commented on code in PR #43695: URL: https://github.com/apache/spark/pull/43695#discussion_r1387412665 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4531,6 +4531,14 @@ object SQLConf { .booleanConf .createWithDefault(false) + val EXTRA_ORIGIN_TRACES = buildConf("spark.sql.extraOriginTraces") +.doc("The number of additional non-Spark SQL traces in the captured DataFrame context. " + + "When it is set to 0, captured one Spark traces and a followed non-Spark trace.") +.version("4.0.0") +.intConf +.checkValue(_ >= 0, "The number of extra thread traces must be non-negative.") Review Comment: I'm a bit confused. Intuitively, I feel that 0 should represent the absence of non-Spark trace. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45831][CORE][SQL][DSTREAM] Use collection factory instead to create immutable Java collections [spark]
dongjoon-hyun commented on code in PR #43709: URL: https://github.com/apache/spark/pull/43709#discussion_r1387410114 ## common/utils/src/main/java/org/apache/spark/network/util/JavaUtils.java: ## @@ -202,29 +202,27 @@ private static boolean isSymlink(File file) throws IOException { private static final Map byteSuffixes; static { -final Map timeSuffixesBuilder = new HashMap<>(); -timeSuffixesBuilder.put("us", TimeUnit.MICROSECONDS); -timeSuffixesBuilder.put("ms", TimeUnit.MILLISECONDS); -timeSuffixesBuilder.put("s", TimeUnit.SECONDS); -timeSuffixesBuilder.put("m", TimeUnit.MINUTES); -timeSuffixesBuilder.put("min", TimeUnit.MINUTES); -timeSuffixesBuilder.put("h", TimeUnit.HOURS); -timeSuffixesBuilder.put("d", TimeUnit.DAYS); -timeSuffixes = Collections.unmodifiableMap(timeSuffixesBuilder); - -final Map byteSuffixesBuilder = new HashMap<>(); -byteSuffixesBuilder.put("b", ByteUnit.BYTE); -byteSuffixesBuilder.put("k", ByteUnit.KiB); -byteSuffixesBuilder.put("kb", ByteUnit.KiB); -byteSuffixesBuilder.put("m", ByteUnit.MiB); -byteSuffixesBuilder.put("mb", ByteUnit.MiB); -byteSuffixesBuilder.put("g", ByteUnit.GiB); -byteSuffixesBuilder.put("gb", ByteUnit.GiB); -byteSuffixesBuilder.put("t", ByteUnit.TiB); -byteSuffixesBuilder.put("tb", ByteUnit.TiB); -byteSuffixesBuilder.put("p", ByteUnit.PiB); -byteSuffixesBuilder.put("pb", ByteUnit.PiB); -byteSuffixes = Collections.unmodifiableMap(byteSuffixesBuilder); + timeSuffixes = Map.of( Review Comment: Indentation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45842][SQL] Refactor Catalog Function APIs to use analyzer [spark]
dongjoon-hyun closed pull request #43720: [SPARK-45842][SQL] Refactor Catalog Function APIs to use analyzer URL: https://github.com/apache/spark/pull/43720 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45686][INFRA][CORE][SQL][SS][CONNECT][MLLIB][DSTREAM][AVRO][ML][K8S][YARN][PYTHON][R][UI][GRAPHX][PROTOBUF][TESTS][EXAMPLES] Explicitly convert `Array` to `Seq` when function input is
LuciferYang commented on PR #43670: URL: https://github.com/apache/spark/pull/43670#issuecomment-1803060096 rebased -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]
beliefer commented on code in PR #43713: URL: https://github.com/apache/spark/pull/43713#discussion_r1387405692 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { case (a: Alias, idx) => val lcaResolved = unwrapLCAReference(a) // Insert the original alias instead of rewritten one to detect chained LCA -aliasMap += (a.toAttribute -> AliasEntry(a, idx)) +aliasMap = + AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> AliasEntry(a, idx))) Review Comment: Sounds good. But it seems created extra AttributeMap. How about `aliasMap = AttributeMap(aliasMap.updated(a.toAttribute, AliasEntry(a, idx)))`? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { case (a: Alias, idx) => val lcaResolved = unwrapLCAReference(a) // Insert the original alias instead of rewritten one to detect chained LCA -aliasMap += (a.toAttribute -> AliasEntry(a, idx)) +aliasMap = + AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> AliasEntry(a, idx))) Review Comment: Sounds good. But it seems created extra `AttributeMap`. How about `aliasMap = AttributeMap(aliasMap.updated(a.toAttribute, AliasEntry(a, idx)))`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45686][INFRA][CORE][SQL][SS][CONNECT][MLLIB][DSTREAM][AVRO][ML][K8S][YARN][PYTHON][R][UI][GRAPHX][PROTOBUF][TESTS][EXAMPLES] Explicitly convert `Array` to `Seq` when function input is
LuciferYang commented on PR #43670: URL: https://github.com/apache/spark/pull/43670#issuecomment-1803057104 > So this should help performance in 2.13 by avoiding defensive copies, and instead wrapping as 2.12 did? Yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45686][INFRA][CORE][SQL][SS][CONNECT][MLLIB][DSTREAM][AVRO][ML][K8S][YARN][PYTHON][R][UI][GRAPHX][PROTOBUF][TESTS][EXAMPLES] Explicitly convert `Array` to `Seq` when function input is
LuciferYang commented on PR #43670: URL: https://github.com/apache/spark/pull/43670#issuecomment-1803056696 > UserDefinedPythonDataSource Yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45834][SQL] Fix Pearson correlation calculation more stable [spark]
liujiayi771 commented on PR #43711: URL: https://github.com/apache/spark/pull/43711#issuecomment-1803054998 The potential side effect of this modification is that it is easier to obtain a finite number for `sqrt(xMk * yMk)`, while `sqrt(xMk) * sqrt(yMk)` can easily result in an infinite number, for example, ```java Math.sqrt(2 * 2) = 2.0 Math.sqrt(2) * Math.sqrt(2) = 2.0004 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
beliefer commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387395341 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala: ## @@ -86,19 +91,31 @@ object CommandUtils extends Logging { // Calculate table size as a sum of the visible partitions. See SPARK-21079 val partitions = sessionState.catalog.listPartitions(catalogTable.identifier) logInfo(s"Starting to calculate sizes for ${partitions.length} partitions.") - val paths = partitions.map(_.storage.locationUri) - val sizes = calculateMultipleLocationSizes(spark, catalogTable.identifier, paths) - val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => -val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), None) -newStats.map(_ => p.copy(stats = newStats)) - } + val (sizes, newPartitions) = calculatePartitionStats(spark, catalogTable, partitions, +partitionRowCount) (sizes.sum, newPartitions) } logInfo(s"It took ${(System.nanoTime() - startTime) / (1000 * 1000)} ms to calculate" + s" the total size for table ${catalogTable.identifier}.") (totalSize, newPartitions) } + def calculatePartitionStats( + spark: SparkSession, + catalogTable: CatalogTable, + partitions: Seq[CatalogTablePartition], + partitionRowCount: Option[Map[TablePartitionSpec, BigInt]] = None): + (Seq[Long], Seq[CatalogTablePartition]) = { +val paths = partitions.map(_.storage.locationUri) +val sizes = calculateMultipleLocationSizes(spark, catalogTable.identifier, paths) +val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => + val newRowCount = partitionRowCount.flatMap(_.get(p.spec)) + val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), newRowCount) + newStats.map(_ => p.copy(stats = newStats)) +} +(sizes, newPartitions) Review Comment: ```suggestion (sizes.sum, newPartitions) ``` ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala: ## @@ -101,56 +98,19 @@ case class AnalyzePartitionCommand( if (noscan) { Map.empty } else { -calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) +CommandUtils.calculateRowCountsPerPartition(sparkSession, tableMeta, partitionValueSpec) } // Update the metastore if newly computed statistics are different from those // recorded in the metastore. - -val sizes = CommandUtils.calculateMultipleLocationSizes(sparkSession, tableMeta.identifier, - partitions.map(_.storage.locationUri)) -val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => - val newRowCount = rowCounts.get(p.spec) - val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), newRowCount) - newStats.map(_ => p.copy(stats = newStats)) -} - +val (_, newPartitions) = CommandUtils.calculatePartitionStats( + sparkSession, tableMeta, partitions, Some(rowCounts)) if (newPartitions.nonEmpty) { sessionState.catalog.alterPartitions(tableMeta.identifier, newPartitions) } Seq.empty[Row] } - private def calculateRowCountsPerPartition( Review Comment: Is `calculateRowCountsPerPartition` shared with other caller? ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala: ## @@ -86,19 +91,31 @@ object CommandUtils extends Logging { // Calculate table size as a sum of the visible partitions. See SPARK-21079 val partitions = sessionState.catalog.listPartitions(catalogTable.identifier) logInfo(s"Starting to calculate sizes for ${partitions.length} partitions.") - val paths = partitions.map(_.storage.locationUri) - val sizes = calculateMultipleLocationSizes(spark, catalogTable.identifier, paths) - val newPartitions = partitions.zipWithIndex.flatMap { case (p, idx) => -val newStats = CommandUtils.compareAndGetNewStats(p.stats, sizes(idx), None) -newStats.map(_ => p.copy(stats = newStats)) - } + val (sizes, newPartitions) = calculatePartitionStats(spark, catalogTable, partitions, +partitionRowCount) (sizes.sum, newPartitions) Review Comment: We can delete this line if `calculatePartitionStats` returns `sizes.sum`. ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -363,6 +363,85 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("SPARK-45731: update partition stats with ANALYZE TABLE") { +val tableName = "analyzeTable_part" + +def queryStats(ds: String): Option[CatalogStatistics] = { + val partition = +spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" ->
Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]
panbingkun commented on code in PR #43713: URL: https://github.com/apache/spark/pull/43713#discussion_r1387392114 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { case (a: Alias, idx) => val lcaResolved = unwrapLCAReference(a) // Insert the original alias instead of rewritten one to detect chained LCA -aliasMap += (a.toAttribute -> AliasEntry(a, idx)) +aliasMap = + AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> AliasEntry(a, idx))) Review Comment: Maybe so that we can hide access to internal values as much as possible. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { case (a: Alias, idx) => val lcaResolved = unwrapLCAReference(a) // Insert the original alias instead of rewritten one to detect chained LCA -aliasMap += (a.toAttribute -> AliasEntry(a, idx)) +aliasMap = + AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> AliasEntry(a, idx))) Review Comment: Maybe so that we can hide access to `internal values` as much as possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45832][SQL] Fix `Super method + is deprecated.` [spark]
panbingkun commented on code in PR #43713: URL: https://github.com/apache/spark/pull/43713#discussion_r1387391412 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveLateralColumnAliasReference.scala: ## @@ -170,7 +170,8 @@ object ResolveLateralColumnAliasReference extends Rule[LogicalPlan] { case (a: Alias, idx) => val lcaResolved = unwrapLCAReference(a) // Insert the original alias instead of rewritten one to detect chained LCA -aliasMap += (a.toAttribute -> AliasEntry(a, idx)) +aliasMap = + AttributeMap(aliasMap.baseMap.values.toMap + (a.toAttribute -> AliasEntry(a, idx))) Review Comment: ```suggestion aliasMap ++= AttributeMap(Map(a.toAttribute -> AliasEntry(a, idx))) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Add support for java 17 from spark 3.5.0 [spark-docker]
Yikun commented on PR #56: URL: https://github.com/apache/spark-docker/pull/56#issuecomment-1803034637 Thanks for your efforts @vakarisbk , I'm going to merge this PR later today or tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]
abellina commented on PR #43627: URL: https://github.com/apache/spark/pull/43627#issuecomment-1803024221 @mridulm thanks for the comments. I think they mostly work and I’m testing it at the moment. I’ll push something either tonight or tomorrow am. On the SPIP, I did send it to dev as a DISCUSS thread on November 4. I would like to get input on that as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45829][DOCS] Update the default value for spark.executor.logs.rolling.maxSize [spark]
chenyu-opensource commented on PR #43712: URL: https://github.com/apache/spark/pull/43712#issuecomment-1803002444 > Thanks. Merged to master 3.5 3.4 3.3 Thank you so much for your review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45814][CONNECT][SQL]Make ArrowConverters.createEmptyArrowBatch call close() to avoid memory leak [spark]
xieshuaihu commented on PR #43691: URL: https://github.com/apache/spark/pull/43691#issuecomment-1803001325 It seems like that this test failure is unrelated to this pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
sunchao commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387358004 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2671,6 +2671,16 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ANALYZE_PARTITION_STATS_ENABLED = +buildConf("spark.sql.statistics.update.partitionStats.enabled") Review Comment: Thanks, updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
sunchao commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387343974 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2671,6 +2671,16 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ANALYZE_PARTITION_STATS_ENABLED = +buildConf("spark.sql.statistics.update.partitionStats.enabled") Review Comment: Hmm, how about `spark.sql.statistics.updatePartitionStatsInAnalyzeTable.enabled`? it is a bit long but can fully express the intention -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
dongjoon-hyun commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387354165 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2671,6 +2671,16 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ANALYZE_PARTITION_STATS_ENABLED = +buildConf("spark.sql.statistics.update.partitionStats.enabled") Review Comment: +1 for the new name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
sunchao commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387343974 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2671,6 +2671,16 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ANALYZE_PARTITION_STATS_ENABLED = +buildConf("spark.sql.statistics.update.partitionStats.enabled") Review Comment: Hmm, how about `spark.sql.statistics.updatePartitionStatsInAnalyzeTable.enabled`? it is a bit long but can fully expression the intention ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2671,6 +2671,16 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ANALYZE_PARTITION_STATS_ENABLED = +buildConf("spark.sql.statistics.update.partitionStats.enabled") Review Comment: Hmm, how about `spark.sql.statistics.updatePartitionStatsInAnalyzeTable.enabled`? it is a bit long but can fully expression the intention ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -372,54 +372,70 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto partition.stats } -withTable(tableName) { - withTempPath { path => -// Create a table with 3 partitions all located under a single top-level directory 'path' -sql( - s""" - |CREATE TABLE $tableName (key STRING, value STRING) - |USING hive - |PARTITIONED BY (ds STRING) - |LOCATION '${path.toURI}' - """.stripMargin) +Seq(true, false).foreach { partitionStatsEnabled => + withSQLConf(SQLConf.ANALYZE_PARTITION_STATS_ENABLED.key -> partitionStatsEnabled.toString) { +withTable(tableName) { + withTempPath { path => +// Create a table with 3 partitions all located under a directory 'path' +sql( + s""" + |CREATE TABLE $tableName (key STRING, value STRING) + |USING hive + |PARTITIONED BY (ds STRING) + |LOCATION '${path.toURI}' + """.stripMargin) -val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") -partitionDates.foreach { ds => - sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION '$path/ds=$ds'") - sql("SELECT * FROM src").write.mode(SaveMode.Overwrite) - .format("parquet").save(s"$path/ds=$ds") -} +partitionDates.foreach { ds => + sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION '$path/ds=$ds'") + sql("SELECT * from src").write.mode(SaveMode.Overwrite) Review Comment: oops will revert this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
dongjoon-hyun commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387340831 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala: ## @@ -372,54 +372,70 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto partition.stats } -withTable(tableName) { - withTempPath { path => -// Create a table with 3 partitions all located under a single top-level directory 'path' -sql( - s""" - |CREATE TABLE $tableName (key STRING, value STRING) - |USING hive - |PARTITIONED BY (ds STRING) - |LOCATION '${path.toURI}' - """.stripMargin) +Seq(true, false).foreach { partitionStatsEnabled => + withSQLConf(SQLConf.ANALYZE_PARTITION_STATS_ENABLED.key -> partitionStatsEnabled.toString) { +withTable(tableName) { + withTempPath { path => +// Create a table with 3 partitions all located under a directory 'path' +sql( + s""" + |CREATE TABLE $tableName (key STRING, value STRING) + |USING hive + |PARTITIONED BY (ds STRING) + |LOCATION '${path.toURI}' + """.stripMargin) -val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") -partitionDates.foreach { ds => - sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION '$path/ds=$ds'") - sql("SELECT * FROM src").write.mode(SaveMode.Overwrite) - .format("parquet").save(s"$path/ds=$ds") -} +partitionDates.foreach { ds => + sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds') LOCATION '$path/ds=$ds'") + sql("SELECT * from src").write.mode(SaveMode.Overwrite) Review Comment: Let's keep the original. It was `FROM` instead of `from`.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
dongjoon-hyun commented on code in PR #43629: URL: https://github.com/apache/spark/pull/43629#discussion_r1387340391 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2671,6 +2671,16 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ANALYZE_PARTITION_STATS_ENABLED = +buildConf("spark.sql.statistics.update.partitionStats.enabled") Review Comment: 1. Do we have other configurations under `spark.sql.statistics.update.*`? We should not add a new namespace for a single configuration. Shall we rename like the following? ``` - spark.sql.statistics.update.partitionStats.enabled + spark.sql.statistics.updatePartitionStats.enabled ``` 2. Also, the current config name looks too general. Can we revise the config name to give some idea about `ANALYZE TABLE` syntax? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45731][SQL] Also update partition statistics with `ANALYZE TABLE` command [spark]
sunchao commented on PR #43629: URL: https://github.com/apache/spark/pull/43629#issuecomment-1802927767 Thanks all! Updated the PR with a new flag. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45845][SS][UI] Add number of evicted state rows to streaming UI [spark]
WweiL commented on PR #43723: URL: https://github.com/apache/spark/pull/43723#issuecomment-1802783939 cc @HeartSaVioR can you take a look please : ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45845][SS][UI] Add number of evicted state rows to streaming UI [spark]
WweiL opened a new pull request, #43723: URL: https://github.com/apache/spark/pull/43723 ### What changes were proposed in this pull request? In a stateful query, a watermark has two responsibilities: 1. drop late rows 2. Evict state rows from state store Before we only log "aggregated number of rows dropped by watermark". This is case 1. But people would confuse this with case 2. This PR purpose we also add a chart for case 2. Also made the explanation for case 1 more verbose. Now: https://github.com/apache/spark/assets/10248890/b1f7f94f-94fe-490b-b2d9-abbb7170b67b;> ### Why are the changes needed? UI improvement ### Does this PR introduce _any_ user-facing change? Yes, users will be able to view the new UI ### How was this patch tested? Manually tested ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45752][SQL] Unreferenced CTE should all be checked by CheckAnalysis0 [spark]
amaliujia commented on code in PR #43614: URL: https://github.com/apache/spark/pull/43614#discussion_r1387214224 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -156,7 +171,15 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB // If a CTE relation is never used, it will disappear after inline. Here we explicitly check // analysis for it, to make sure the entire query plan is valid. try { -if (refCount == 0) checkAnalysis0(relation.child) Review Comment: Ok I added some comments to explain we are trying to check the leaf relation first because that gives the most accurate error message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45827] Add Variant data type in Spark. [spark]
chenhao-db commented on code in PR #43707: URL: https://github.com/apache/spark/pull/43707#discussion_r1387206814 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala: ## @@ -326,6 +327,17 @@ case class PhysicalStructType(fields: Array[StructField]) extends PhysicalDataTy } } +class PhysicalVariantType extends PhysicalDataType { Review Comment: I don't feel there is anything special and worth a comment in this class. All `Physical*Type` classes in this file don't have a comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45843][CORE] Support `killall` in REST Submission API [spark]
dongjoon-hyun closed pull request #43721: [SPARK-45843][CORE] Support `killall` in REST Submission API URL: https://github.com/apache/spark/pull/43721 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org