[GitHub] [spark] HyukjinKwon closed pull request #34687: [SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series
HyukjinKwon closed pull request #34687: URL: https://github.com/apache/spark/pull/34687 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34687: [SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series
HyukjinKwon commented on pull request #34687: URL: https://github.com/apache/spark/pull/34687#issuecomment-976348748 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark
SparkQA commented on pull request #34677: URL: https://github.com/apache/spark/pull/34677#issuecomment-976432192 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50019/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth opened a new pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
peter-toth opened a new pull request #34693: URL: https://github.com/apache/spark/pull/34693 ### What changes were proposed in this pull request? CTE queries are not supported with MSSQL server via JDBC as MSSQL server doesn't support statements with nested CTEs. When Spark builds the final query, that it will send via JDBC to the server, it wraps the original query (`options.tableOrQuery`) into parentheses in `JDBCRDD.resolveTable()` and `JDBCRDD.compute()`. Unfortunately, it is non-trivial to split an arbitrary query it into "with" and "regular" query clauses in `MsSqlServerDialect`. So instead, I'm proposing a new general JDBC option "withClause" that users can use if they have complex queries with CTE. ### Why are the changes needed? To support CTE queries with MSSQL. ### Does this PR introduce _any_ user-facing change? Yes, CTE queries are supported form now. ### How was this patch tested? Added new integration UTs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
SparkQA commented on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-976489807 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50020/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
AmplabJenkins commented on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-976490264 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145540/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark
AmplabJenkins removed a comment on pull request #34677: URL: https://github.com/apache/spark/pull/34677#issuecomment-976494224 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145547/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34691: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
SparkQA removed a comment on pull request #34691: URL: https://github.com/apache/spark/pull/34691#issuecomment-976296694 **[Test build #145544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145544/testReport)** for PR 34691 at commit [`0c393a5`](https://github.com/apache/spark/commit/0c393a586b2a172c0138eeacf7552d8561157af4). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
cloud-fan commented on a change in pull request #34668: URL: https://github.com/apache/spark/pull/34668#discussion_r755141437 ## File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ## @@ -73,6 +78,19 @@ grammar SqlBase; return false; } } + + /** + * This method will be called when the character stream ends and try to find out the + * unclosed bracketed comment. + * If the next character is -1, it means the end of the entire character stream match, + * and we throw exception to prevent executing the query. + */ + public void end() { Review comment: nit: `checkUnclosedComment` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755165986 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ## @@ -355,7 +377,14 @@ case class FileSourceScanExec( @transient private lazy val pushedDownFilters = { val supportNestedPredicatePushdown = DataSourceUtils.supportNestedPredicatePushdown(relation) -dataFilters.flatMap(DataSourceStrategy.translateFilter(_, supportNestedPredicatePushdown)) +dataFilters + .filterNot( +_.references.exists { + case MetadataAttribute(_) => true Review comment: can we leave a TODO comment here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755170678 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -103,6 +116,135 @@ class FileScanRDD( context.killTaskIfInterrupted() (currentIterator != null && currentIterator.hasNext) || nextIterator() } + + /// + // FILE METADATA METHODS // + /// + + // whether a metadata column exists and it is a `MetadataAttribute` + private lazy val hasMetadataAttribute: Boolean = { Review comment: do we need this `lazy val`? I think this is simply `metadataStructCol.isDefined`, as the caller side should always pass in the metadata col. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
AmplabJenkins removed a comment on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-97983 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50021/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty
cloud-fan commented on a change in pull request #34504: URL: https://github.com/apache/spark/pull/34504#discussion_r755214307 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe filter } +case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty => + w.windowExpressions match { +case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec, +SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) => + val aliasAttr = alias.toAttribute + val limitValue = splitConjunctivePredicates(condition).collectFirst { +case LessThanOrEqual(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v +case Equality(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v +case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v - 1 + } + + limitValue match { +case Some(lv) if lv <= 0 => + LocalRelation(filter.output, data = Seq.empty, isStreaming = filter.isStreaming) +case Some(lv) +if lv < conf.topKSortFallbackThreshold && w.child.maxRows.forall(_ > lv) => + filter.copy(child = +w.copy(child = Limit(Literal(lv), Sort(orderSpec, true, w.child Review comment: BTW, is `orderSpec` always specified? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #34454: [SPARK-37013][CORE][SQL][FOLLOWUP] Use the new error framework to throw error in `FormatString`
cloud-fan closed pull request #34454: URL: https://github.com/apache/spark/pull/34454 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #34454: [SPARK-37013][CORE][SQL][FOLLOWUP] Use the new error framework to throw error in `FormatString`
cloud-fan commented on pull request #34454: URL: https://github.com/apache/spark/pull/34454#issuecomment-976686019 Sorry I missed the ping. Merging to master, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #34679: [SPARK-37437][BUILD] Remove unused hive profile
srowen commented on a change in pull request #34679: URL: https://github.com/apache/spark/pull/34679#discussion_r755238638 ## File path: pom.xml ## @@ -3353,11 +3353,6 @@ - Review comment: We could possibly leave the profile in and have it still do nothing, as if the profile doesn't exist, a build with -Phive-2.3 fails, even though it's really "OK" - it's already hive 2.3 by default. I don't feel strongly about it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
cloud-fan commented on a change in pull request #34668: URL: https://github.com/apache/spark/pull/34668#discussion_r755142758 ## File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ## @@ -73,6 +78,19 @@ grammar SqlBase; return false; } } + + /** + * This method will be called when the character stream ends and try to find out the + * unclosed bracketed comment. + * If the next character is -1, it means the end of the entire character stream match, Review comment: BTW do we need to check? looking at `('*/' | {end();} EOF)`, seems it's always EOF if we enter this method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
cloud-fan commented on a change in pull request #34668: URL: https://github.com/apache/spark/pull/34668#discussion_r755141872 ## File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ## @@ -73,6 +78,19 @@ grammar SqlBase; return false; } } + + /** + * This method will be called when the character stream ends and try to find out the + * unclosed bracketed comment. + * If the next character is -1, it means the end of the entire character stream match, + * and we throw exception to prevent executing the query. Review comment: not throw exception, we set the flag and fail later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755178439 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -103,6 +116,135 @@ class FileScanRDD( context.killTaskIfInterrupted() (currentIterator != null && currentIterator.hasNext) || nextIterator() } + + /// + // FILE METADATA METHODS // + /// + + // whether a metadata column exists and it is a `MetadataAttribute` + private lazy val hasMetadataAttribute: Boolean = { +metadataStruct.exists { + case MetadataAttribute(_) => true + case _ => false +} + } + + // metadata struct unsafe row, will only be updated when the current file is changed + @volatile private var metadataStructUnsafeRow: UnsafeRow = _ + // metadata generic row, will only be updated when the current file is changed + @volatile private var metadataStructGenericRow: Row = _ + // an unsafe joiner to join an unsafe row with the metadata unsafe row + lazy private val unsafeRowJoiner = +if (hasMetadataAttribute) + GenerateUnsafeRowJoiner.create(requiredSchema, Seq(metadataStruct.get).toStructType) + + // Create a off/on heap WritableColumnVector + private def createColumnVector(numRows: Int, dataType: DataType): WritableColumnVector = { +if (offHeapColumnVectorEnabled) { + new OffHeapColumnVector(numRows, dataType) +} else { + new OnHeapColumnVector(numRows, dataType) +} + } + + /** + * For each partitioned file, metadata columns for each record in the file are exactly same. + * Only update metadata columns when `currentFile` is changed. + */ + private def updateMetadataStruct(): Unit = { +if (hasMetadataAttribute) { + val meta = metadataStruct.get + if (currentFile == null) { +metadataStructUnsafeRow = new UnsafeRow(1) +metadataStructGenericRow = new GenericRow(1) Review comment: seem we can just null out these 2 rows. If `currentFile` is null, we won't output any more records, and these 2 rows are useless. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #34642: [SPARK-37369][SQL] Avoid redundant ColumnarToRow transistion on InMemoryTableScan
cloud-fan commented on pull request #34642: URL: https://github.com/apache/spark/pull/34642#issuecomment-976690695 I'm trying to understand the motivation. Is it because in-memory table can output rows efficiently? Parquet scan can also output rows but we try our best to output columnar batches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #34692: [SPARK-11792][FOLLOWUP] Update scaladoc of KnownSizeEstimation
srowen commented on pull request #34692: URL: https://github.com/apache/spark/pull/34692#issuecomment-976711598 Because that JIRA is so old, we wouldn't really treat this as part of that JIRA. I'll just make it a minor docs PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source
gengliangwang commented on a change in pull request #34596: URL: https://github.com/apache/spark/pull/34596#discussion_r755249223 ## File path: sql/core/src/test/resources/sql-tests/results/timestampNTZ/timestamp.sql.out ## @@ -373,17 +374,19 @@ structhttps://user-images.githubusercontent.com/1097932/143054054-2cc408d6-0894-4584-9034-4561fe82ab6c.png) I look it up in the ANSI SQL spec. The string input here can be considered as Timestamp with TZ. I think we should allow casting timestamp string with TZ to TimestampNTZ. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
peter-toth commented on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-976828400 Hmm, failures in `ExpressionsSchemaSuite` look unrelated... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34691: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
SparkQA commented on pull request #34691: URL: https://github.com/apache/spark/pull/34691#issuecomment-976549263 **[Test build #145544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145544/testReport)** for PR 34691 at commit [`0c393a5`](https://github.com/apache/spark/commit/0c393a586b2a172c0138eeacf7552d8561157af4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34070: [SPARK-36840][SQL] Support DPP if there is no selective predicate on the filtering side
AmplabJenkins commented on pull request #34070: URL: https://github.com/apache/spark/pull/34070#issuecomment-976570564 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145545/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34070: [SPARK-36840][SQL] Support DPP if there is no selective predicate on the filtering side
AmplabJenkins removed a comment on pull request #34070: URL: https://github.com/apache/spark/pull/34070#issuecomment-976570564 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145545/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
AmplabJenkins removed a comment on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-976570563 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50020/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34691: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
AmplabJenkins removed a comment on pull request #34691: URL: https://github.com/apache/spark/pull/34691#issuecomment-976570562 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145544/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755163320 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ## @@ -194,10 +195,22 @@ case class FileSourceScanExec( disableBucketedScan: Boolean = false) extends DataSourceScanExec { + lazy val outputMetadataStruct: Option[AttributeReference] = +output.collectFirst { case MetadataAttribute(attr) => attr } + // Note that some vals referring the file-based relation are lazy intentionally // so that this plan can be canonicalized on executor side too. See SPARK-23731. override lazy val supportsColumnar: Boolean = { -relation.fileFormat.supportBatch(relation.sparkSession, schema) +// schema without the file metadata column +val fileSchema = if (outputMetadataStruct.isEmpty) schema else { + StructType.fromAttributes( Review comment: nit: `StructType.fromAttributes(output.filterNot(_.exprId == metadataStructCol.get.exprId))` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755163320 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ## @@ -194,10 +195,22 @@ case class FileSourceScanExec( disableBucketedScan: Boolean = false) extends DataSourceScanExec { + lazy val outputMetadataStruct: Option[AttributeReference] = +output.collectFirst { case MetadataAttribute(attr) => attr } + // Note that some vals referring the file-based relation are lazy intentionally // so that this plan can be canonicalized on executor side too. See SPARK-23731. override lazy val supportsColumnar: Boolean = { -relation.fileFormat.supportBatch(relation.sparkSession, schema) +// schema without the file metadata column +val fileSchema = if (outputMetadataStruct.isEmpty) schema else { + StructType.fromAttributes( Review comment: nit: `output.filterNot(_.exprId == metadataStructCol.get.exprId).toStruct` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755163320 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ## @@ -194,10 +195,22 @@ case class FileSourceScanExec( disableBucketedScan: Boolean = false) extends DataSourceScanExec { + lazy val outputMetadataStruct: Option[AttributeReference] = +output.collectFirst { case MetadataAttribute(attr) => attr } + // Note that some vals referring the file-based relation are lazy intentionally // so that this plan can be canonicalized on executor side too. See SPARK-23731. override lazy val supportsColumnar: Boolean = { -relation.fileFormat.supportBatch(relation.sparkSession, schema) +// schema without the file metadata column +val fileSchema = if (outputMetadataStruct.isEmpty) schema else { + StructType.fromAttributes( Review comment: nit: `output.filter(_.exprId != metadataStructCol.get.exprId).toStruct` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755167684 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala ## @@ -171,6 +171,28 @@ trait FileFormat { def supportFieldName(name: String): Boolean = true } +object FileFormat { + + val FILE_PATH = "file_path" + + val FILE_NAME = "file_name" + + val FILE_SIZE = "file_size" + + val FILE_MODIFICATION_TIME = "file_modification_time" + + val METADATA_NAME = "_metadata" + + val METADATA_STRUCT: DataType = new StructType() +.add(StructField(FILE_PATH, StringType)) +.add(StructField(FILE_NAME, StringType)) +.add(StructField(FILE_SIZE, LongType)) +.add(StructField(FILE_MODIFICATION_TIME, LongType)) + + // supported metadata columns for hadoop fs relation + def FILE_METADATA_COLUMNS: AttributeReference = MetadataAttribute(METADATA_NAME, METADATA_STRUCT) Review comment: since it's a `def` now, how about `def createFileMetadataCol` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34070: [SPARK-36840][SQL] Support DPP if there is no selective predicate on the filtering side
SparkQA removed a comment on pull request #34070: URL: https://github.com/apache/spark/pull/34070#issuecomment-976297392 **[Test build #145545 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145545/testReport)** for PR 34070 at commit [`50484e2`](https://github.com/apache/spark/commit/50484e23c714e06057993757efa20034a261d2bf). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34070: [SPARK-36840][SQL] Support DPP if there is no selective predicate on the filtering side
SparkQA commented on pull request #34070: URL: https://github.com/apache/spark/pull/34070#issuecomment-976551994 **[Test build #145545 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145545/testReport)** for PR 34070 at commit [`50484e2`](https://github.com/apache/spark/commit/50484e23c714e06057993757efa20034a261d2bf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
cloud-fan commented on a change in pull request #34668: URL: https://github.com/apache/spark/pull/34668#discussion_r755150850 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala ## @@ -78,9 +79,30 @@ abstract class AbstractSqlParser extends ParserInterface with SQLConfHelper with astBuilder.visitQuery(parser.query()) } + /** check `has_unclosed_bracketed_comment` to find out the unclosed bracketed comment. */ + private def singleStatementWithCheck( + parser: SqlBaseParser, sqlText: String): SingleStatementContext = { +val singleStatementContext = parser.singleStatement() +assert(parser.getTokenStream.isInstanceOf[CommonTokenStream]) + +val tokenStream = parser.getTokenStream.asInstanceOf[CommonTokenStream] +assert(tokenStream.getTokenSource.isInstanceOf[SqlBaseLexer]) + +val lexer = tokenStream.getTokenSource.asInstanceOf[SqlBaseLexer] +if (lexer.has_unclosed_bracketed_comment) { + // The last token is 'EOF' and the penultimate is unclosed bracketed comment + val failedToken = tokenStream.get(tokenStream.size() - 2) + assert(failedToken.getType() == SqlBaseParser.BRACKETED_COMMENT) + val position = Origin(Option(failedToken.getLine), Option(failedToken.getCharPositionInLine)) + throw QueryParsingErrors.unclosedBracketedCommentError(sqlText, position) +} + +singleStatementContext + } + /** Creates LogicalPlan for a given SQL string. */ override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser => -astBuilder.visitSingleStatement(parser.singleStatement()) match { +astBuilder.visitSingleStatement(singleStatementWithCheck(parser, sqlText)) match { Review comment: Can we check the unclosed comment using a listener in the `parse` method below? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755164613 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ## @@ -212,7 +225,16 @@ case class FileSourceScanExec( relation.fileFormat.vectorTypes( requiredSchema = requiredSchema, partitionSchema = relation.partitionSchema, - relation.sparkSession.sessionState.conf) + relation.sparkSession.sessionState.conf).map { vectorTypes => +val metadataVectorClz = + if (relation.sparkSession.sessionState.conf.offHeapColumnVectorEnabled) { +classOf[OffHeapColumnVector].getName + } else { +classOf[OnHeapColumnVector].getName Review comment: since we will change to use a constant vector soon, how about we always use `OnHeapColumnVector` for now, to simplify the code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755183339 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -103,6 +116,135 @@ class FileScanRDD( context.killTaskIfInterrupted() (currentIterator != null && currentIterator.hasNext) || nextIterator() } + + /// + // FILE METADATA METHODS // + /// + + // whether a metadata column exists and it is a `MetadataAttribute` + private lazy val hasMetadataAttribute: Boolean = { +metadataStruct.exists { + case MetadataAttribute(_) => true + case _ => false +} + } + + // metadata struct unsafe row, will only be updated when the current file is changed + @volatile private var metadataStructUnsafeRow: UnsafeRow = _ + // metadata generic row, will only be updated when the current file is changed + @volatile private var metadataStructGenericRow: Row = _ + // an unsafe joiner to join an unsafe row with the metadata unsafe row + lazy private val unsafeRowJoiner = +if (hasMetadataAttribute) + GenerateUnsafeRowJoiner.create(requiredSchema, Seq(metadataStruct.get).toStructType) + + // Create a off/on heap WritableColumnVector + private def createColumnVector(numRows: Int, dataType: DataType): WritableColumnVector = { +if (offHeapColumnVectorEnabled) { + new OffHeapColumnVector(numRows, dataType) +} else { + new OnHeapColumnVector(numRows, dataType) +} + } + + /** + * For each partitioned file, metadata columns for each record in the file are exactly same. + * Only update metadata columns when `currentFile` is changed. + */ + private def updateMetadataStruct(): Unit = { +if (hasMetadataAttribute) { + val meta = metadataStruct.get + if (currentFile == null) { +metadataStructUnsafeRow = new UnsafeRow(1) +metadataStructGenericRow = new GenericRow(1) + } else { +// make an generic row +assert(meta.dataType.isInstanceOf[StructType]) +metadataStructGenericRow = Row.fromSeq( Review comment: And does `metadataStructGenericRow` need to be a class member? Seems can be a local variable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755188315 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataColumnsSuite.scala ## @@ -0,0 +1,481 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.File +import java.nio.file.Files +import java.text.SimpleDateFormat + +import org.apache.spark.sql.{AnalysisException, Column, DataFrame, QueryTest, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSparkSession +import org.apache.spark.sql.types.{IntegerType, LongType, StringType, StructField, StructType} + +class FileMetadataColumnsSuite extends QueryTest with SharedSparkSession { + + val data0: String = +""" + |jack,24,12345,uom + |""".stripMargin + + val data1: String = +""" + |lily,31,,ucb + |""".stripMargin + + val schema: StructType = new StructType() +.add(StructField("name", StringType)) +.add(StructField("age", IntegerType)) +.add(StructField("id", LongType)) +.add(StructField("university", StringType)) + + val schemaWithNameConflicts: StructType = new StructType() +.add(StructField("name", StringType)) +.add(StructField("age", IntegerType)) +.add(StructField("_metadata.file_size", LongType)) +.add(StructField("_metadata.FILE_NAME", StringType)) + + private val METADATA_FILE_PATH = "_metadata.file_path" + + private val METADATA_FILE_NAME = "_metadata.file_name" + + private val METADATA_FILE_SIZE = "_metadata.file_size" + + private val METADATA_FILE_MODIFICATION_TIME = "_metadata.file_modification_time" + + /** + * Create a CSV file named `fileName` with `data` under `dir` directory. + */ + private def createCSVFile(data: String, dir: File, fileName: String): String = { +val dataFile = new File(dir, s"/$fileName") +dataFile.getParentFile.mkdirs() +val bytes = data.getBytes() +Files.write(dataFile.toPath, bytes) +dataFile.getCanonicalPath + } + + /** + * This test wrapper will test for both row-based and column-based file formats (csv and parquet) + * 1. read data0 and data1 and write them as testFileFormat: f0 and f1 + * 2. read both f0 and f1, return the df to the downstream for further testing + * 3. construct actual metadata map for both f0 and f1 to the downstream for further testing + * + * The final df will have data: + * jack | 24 | 12345 | uom + * lily | 31 | null | ucb + * + * The schema of the df will be the `fileSchema` provided to this method + * + * This test wrapper will provide a `df` and actual metadata map `f0`, `f1` + */ + private def metadataColumnsTest( + testName: String, fileSchema: StructType) +(f: (DataFrame, Map[String, Any], Map[String, Any]) => Unit): Unit = { +Seq("csv", "parquet").foreach { testFileFormat => + test(s"metadata columns ($testFileFormat): " + testName) { +withTempDir { dir => + // read data0 as CSV and write f0 as testFileFormat + val df0 = spark.read.schema(fileSchema).csv( +createCSVFile(data0, dir, "temp/0") + ) + val f0Path = new File(dir, "data/f0").getCanonicalPath + df0.coalesce(1).write.format(testFileFormat).save(f0Path) + + // read data1 as CSV and write f1 as testFileFormat + val df1 = spark.read.schema(fileSchema).csv( +createCSVFile(data1, dir, "temp/1") + ) + val f1Path = new File(dir, "data/f1").getCanonicalPath + df1.coalesce(1).write.format(testFileFormat).save(f1Path) + + // read both f0 and f1 + val df = spark.read.format(testFileFormat).schema(fileSchema) +.load(new File(dir, "data").getCanonicalPath + "/*") + + val realF0 = new File(dir, "data/f0").listFiles() +.filter(_.getName.endsWith(s".$testFileFormat")).head + + val realF1 = new File(dir, "data/f1").listFiles() +.filter(_.getName.endsWith(s".$testFileFormat")).head + + // construct f0 and f1 metadata data + val f0Metadata = Map( +
[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty
cloud-fan commented on a change in pull request #34504: URL: https://github.com/apache/spark/pull/34504#discussion_r755213627 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe filter } +case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty => + w.windowExpressions match { +case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec, +SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) => + val aliasAttr = alias.toAttribute + val limitValue = splitConjunctivePredicates(condition).collectFirst { +case LessThanOrEqual(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v +case Equality(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v +case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) => v - 1 + } + + limitValue match { +case Some(lv) if lv <= 0 => + LocalRelation(filter.output, data = Seq.empty, isStreaming = filter.isStreaming) +case Some(lv) +if lv < conf.topKSortFallbackThreshold && w.child.maxRows.forall(_ > lv) => + filter.copy(child = +w.copy(child = Limit(Literal(lv), Sort(orderSpec, true, w.child Review comment: One worry is what if other optimizer rules put/move operators between Limit and Sort? Then we can't use the `TakeOrderedAndProjectExec` and introduce a big overhead by this global sort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
peter-toth commented on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-97672 This change also seem to work with MSSQL's temp table syntax: ``` val withClause = "(SELECT * INTO #TempTable FROM (SELECT * FROM tbl) t)" val query = "SELECT * FROM #TempTable" val df = spark.read.format("jdbc") .option("url", jdbcUrl) .option("withClause", withClause) .option("query", query) .load() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-976793578 **[Test build #145550 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145550/testReport)** for PR 34671 at commit [`61fe9fb`](https://github.com/apache/spark/commit/61fe9fb6bc09863aa74ab82ac5d653e2b922ae0a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark
AmplabJenkins commented on pull request #34677: URL: https://github.com/apache/spark/pull/34677#issuecomment-976494224 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145547/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] thejdeep commented on pull request #34607: [SPARK-36038][CORE] Speculation metrics summary at stage level
thejdeep commented on pull request #34607: URL: https://github.com/apache/spark/pull/34607#issuecomment-976532846 > > I changed maybeUpdate to update when we encounter a speculative task. The problem with the previous approach was that - if we have a speculative task, then future task end events would all make unnecessary speculation summary writes to DB. What do you think ? > > How about changing `speculationStageSummary` in `LiveStage` to `var speculationStageSummary: Option[LiveSpeculationStageSummary]` and then, we can check whether there is any update about speculation summary. If it's `Some`, we can update with `maybeUpdate`. @sarutak Would not really prefer it since it would require changing a class val to a var. I am assuming this is the change you suggested : ``` AppStatusListener.scala if (event.taskInfo.speculative) { stage.speculationStageSummary = Some(stage.speculationStageSummary.getOrElse( new LiveSpeculationStageSummary(event.stageId, event.stageAttemptId))) val speculationStageSummary = stage.speculationStageSummary.get speculationStageSummary.numActiveTasks -= 1 speculationStageSummary.numCompletedTasks += completedDelta speculationStageSummary.numFailedTasks += failedDelta speculationStageSummary.numKilledTasks += killedDelta } if(stage.speculationStageSummary.isDefined) { maybeUpdate(stage.speculationStageSummary.get, now) } ``` and ``` LiveEntity.scala var speculationStageSummary: Option[LiveSpeculationStageSummary] = None ``` What do you think ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755180169 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -103,6 +116,135 @@ class FileScanRDD( context.killTaskIfInterrupted() (currentIterator != null && currentIterator.hasNext) || nextIterator() } + + /// + // FILE METADATA METHODS // + /// + + // whether a metadata column exists and it is a `MetadataAttribute` + private lazy val hasMetadataAttribute: Boolean = { +metadataStruct.exists { + case MetadataAttribute(_) => true + case _ => false +} + } + + // metadata struct unsafe row, will only be updated when the current file is changed + @volatile private var metadataStructUnsafeRow: UnsafeRow = _ + // metadata generic row, will only be updated when the current file is changed + @volatile private var metadataStructGenericRow: Row = _ + // an unsafe joiner to join an unsafe row with the metadata unsafe row + lazy private val unsafeRowJoiner = +if (hasMetadataAttribute) + GenerateUnsafeRowJoiner.create(requiredSchema, Seq(metadataStruct.get).toStructType) + + // Create a off/on heap WritableColumnVector + private def createColumnVector(numRows: Int, dataType: DataType): WritableColumnVector = { +if (offHeapColumnVectorEnabled) { + new OffHeapColumnVector(numRows, dataType) +} else { + new OnHeapColumnVector(numRows, dataType) +} + } + + /** + * For each partitioned file, metadata columns for each record in the file are exactly same. + * Only update metadata columns when `currentFile` is changed. + */ + private def updateMetadataStruct(): Unit = { +if (hasMetadataAttribute) { + val meta = metadataStruct.get + if (currentFile == null) { +metadataStructUnsafeRow = new UnsafeRow(1) +metadataStructGenericRow = new GenericRow(1) + } else { +// make an generic row +assert(meta.dataType.isInstanceOf[StructType]) +metadataStructGenericRow = Row.fromSeq( + meta.dataType.asInstanceOf[StructType].names.map { +case FILE_PATH => UTF8String.fromString(new File(currentFile.filePath).toString) +case FILE_NAME => UTF8String.fromString( + currentFile.filePath.split("/").last) +case FILE_SIZE => currentFile.fileSize +case FILE_MODIFICATION_TIME => currentFile.modificationTime +case _ => None // be exhaustive, won't happen + } +) + +// convert the generic row to an unsafe row +val unsafeRowConverter = { + val converter = UnsafeProjection.create( +Array(METADATA_STRUCT)) Review comment: seems safer to use `Array(meta.dataType.asInstanceOf[StructType])` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nchammas commented on pull request #34655: [SPARK-37380][PYTHON] Miscellaneous Python lint infra cleanup
nchammas commented on pull request #34655: URL: https://github.com/apache/spark/pull/34655#issuecomment-976699676 @HyukjinKwon - Is there anyone else you think should review this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled
cloud-fan commented on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-976714972 Why is there no problem with AQE off? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #34692: [MINOR][DOCS] Update scaladoc of KnownSizeEstimation
srowen commented on a change in pull request #34692: URL: https://github.com/apache/spark/pull/34692#discussion_r755242633 ## File path: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ## @@ -33,10 +33,9 @@ import org.apache.spark.util.collection.OpenHashSet /** * A trait that allows a class to give [[SizeEstimator]] more accurate size estimation. - * When a class extends it, [[SizeEstimator]] will query the `estimatedSize` first. - * If `estimatedSize` does not return `None`, [[SizeEstimator]] will use the returned size - * as the size of the object. Otherwise, [[SizeEstimator]] will do the estimation work. - * The difference between a [[KnownSizeEstimation]] and + * When a class extends it, [[SizeEstimator]] will query the `estimatedSize`, and use + * the returned size as the size of the object. Otherwise, [[SizeEstimator]] will do Review comment: Can you explain this more? I'm not clear what the difference is in these statements -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source
cloud-fan commented on a change in pull request #34596: URL: https://github.com/apache/spark/pull/34596#discussion_r755252311 ## File path: sql/core/src/test/resources/sql-tests/results/timestampNTZ/timestamp.sql.out ## @@ -373,17 +374,19 @@ struct
[GitHub] [spark] SparkQA removed a comment on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
SparkQA removed a comment on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-976453455 **[Test build #145548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145548/testReport)** for PR 34693 at commit [`e2c9577`](https://github.com/apache/spark/commit/e2c957718196f8269c56fb4e3c6a8b4d1eed3074). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
SparkQA commented on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-976827223 **[Test build #145548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145548/testReport)** for PR 34693 at commit [`e2c9577`](https://github.com/apache/spark/commit/e2c957718196f8269c56fb4e3c6a8b4d1eed3074). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans
AmplabJenkins removed a comment on pull request #34611: URL: https://github.com/apache/spark/pull/34611#issuecomment-976490264 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145540/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark
SparkQA commented on pull request #34677: URL: https://github.com/apache/spark/pull/34677#issuecomment-976491507 **[Test build #145547 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145547/testReport)** for PR 34677 at commit [`0b67651`](https://github.com/apache/spark/commit/0b6765150798799a418d39209b4e5f6d4a16276e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class SQLStringFormatter(string.Formatter):` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] thejdeep commented on a change in pull request #34607: [SPARK-36038][CORE] Speculation metrics summary at stage level
thejdeep commented on a change in pull request #34607: URL: https://github.com/apache/spark/pull/34607#discussion_r755122578 ## File path: core/src/main/scala/org/apache/spark/status/storeTypes.scala ## @@ -399,6 +399,20 @@ private[spark] class ExecutorStageSummaryWrapper( } +private[spark] class SpeculationStageSummaryWrapper( +val stageId: Int, +val stageAttemptId: Int, +val info: SpeculationStageSummary) { + + @JsonIgnore @KVIndex + private val _id: Array[Int] = Array(stageId, stageAttemptId) + + @JsonIgnore @KVIndex("stage") + private def stage: Array[Int] = Array(stageId, stageAttemptId) + + private[this] val id: Array[Int] = _id Review comment: KVStore API requires an `id` but this was implemented just for convenience over class members and API constructs. I can remove `_id` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
SparkQA commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-976553555 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50021/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755161598 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ## @@ -194,10 +195,22 @@ case class FileSourceScanExec( disableBucketedScan: Boolean = false) extends DataSourceScanExec { + lazy val outputMetadataStruct: Option[AttributeReference] = Review comment: nit: it's probably better to use a noun here, how about `metadataStructCol`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sarutak commented on pull request #34607: [SPARK-36038][CORE] Speculation metrics summary at stage level
sarutak commented on pull request #34607: URL: https://github.com/apache/spark/pull/34607#issuecomment-976610243 > since it would require changing a class val to a var. What's the problem? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755175121 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -103,6 +116,135 @@ class FileScanRDD( context.killTaskIfInterrupted() (currentIterator != null && currentIterator.hasNext) || nextIterator() } + + /// + // FILE METADATA METHODS // + /// + + // whether a metadata column exists and it is a `MetadataAttribute` + private lazy val hasMetadataAttribute: Boolean = { +metadataStruct.exists { + case MetadataAttribute(_) => true + case _ => false +} + } + + // metadata struct unsafe row, will only be updated when the current file is changed + @volatile private var metadataStructUnsafeRow: UnsafeRow = _ + // metadata generic row, will only be updated when the current file is changed + @volatile private var metadataStructGenericRow: Row = _ + // an unsafe joiner to join an unsafe row with the metadata unsafe row + lazy private val unsafeRowJoiner = +if (hasMetadataAttribute) + GenerateUnsafeRowJoiner.create(requiredSchema, Seq(metadataStruct.get).toStructType) + + // Create a off/on heap WritableColumnVector + private def createColumnVector(numRows: Int, dataType: DataType): WritableColumnVector = { +if (offHeapColumnVectorEnabled) { + new OffHeapColumnVector(numRows, dataType) +} else { + new OnHeapColumnVector(numRows, dataType) +} + } + + /** + * For each partitioned file, metadata columns for each record in the file are exactly same. + * Only update metadata columns when `currentFile` is changed. + */ + private def updateMetadataStruct(): Unit = { +if (hasMetadataAttribute) { + val meta = metadataStruct.get + if (currentFile == null) { +metadataStructUnsafeRow = new UnsafeRow(1) +metadataStructGenericRow = new GenericRow(1) + } else { +// make an generic row +assert(meta.dataType.isInstanceOf[StructType]) +metadataStructGenericRow = Row.fromSeq( + meta.dataType.asInstanceOf[StructType].names.map { +case FILE_PATH => UTF8String.fromString(new File(currentFile.filePath).toString) Review comment: Why can't we use `filePath` directly? `new File` looks buggy as it won't work for `s3://` or other file systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty
cloud-fan commented on a change in pull request #34504: URL: https://github.com/apache/spark/pull/34504#discussion_r755210674 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelpe filter } +case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty => + w.windowExpressions match { +case Seq(alias @ Alias(WindowExpression(_: RowNumber, WindowSpecDefinition(Nil, orderSpec, +SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _)) => Review comment: I think this optimization should also apply if there are more than one window expressions. The algorithm should simply be ``` val limits = w.windowExpressions.map { case row_number_func => calcuate the limit case _ => none } val limit = limits.min add limit + sort ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-976763063 **[Test build #145550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145550/testReport)** for PR 34671 at commit [`61fe9fb`](https://github.com/apache/spark/commit/61fe9fb6bc09863aa74ab82ac5d653e2b922ae0a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-976826549 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50022/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark
SparkQA removed a comment on pull request #34677: URL: https://github.com/apache/spark/pull/34677#issuecomment-976356000 **[Test build #145547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145547/testReport)** for PR 34677 at commit [`0b67651`](https://github.com/apache/spark/commit/0b6765150798799a418d39209b4e5f6d4a16276e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
SparkQA commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-976492795 **[Test build #145549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145549/testReport)** for PR 34668 at commit [`24d3046`](https://github.com/apache/spark/commit/24d3046b5c3c58c2e2d2edf7a9f6fa095ba75682). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
AmplabJenkins commented on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-976570563 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50020/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34691: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val
AmplabJenkins commented on pull request #34691: URL: https://github.com/apache/spark/pull/34691#issuecomment-976570562 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145544/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
SparkQA commented on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-976568407 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50020/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
cloud-fan commented on a change in pull request #34668: URL: https://github.com/apache/spark/pull/34668#discussion_r755144886 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala ## @@ -78,9 +79,30 @@ abstract class AbstractSqlParser extends ParserInterface with SQLConfHelper with astBuilder.visitQuery(parser.query()) } + /** check `has_unclosed_bracketed_comment` to find out the unclosed bracketed comment. */ + private def singleStatementWithCheck( + parser: SqlBaseParser, sqlText: String): SingleStatementContext = { +val singleStatementContext = parser.singleStatement() +assert(parser.getTokenStream.isInstanceOf[CommonTokenStream]) + +val tokenStream = parser.getTokenStream.asInstanceOf[CommonTokenStream] +assert(tokenStream.getTokenSource.isInstanceOf[SqlBaseLexer]) + +val lexer = tokenStream.getTokenSource.asInstanceOf[SqlBaseLexer] +if (lexer.has_unclosed_bracketed_comment) { + // The last token is 'EOF' and the penultimate is unclosed bracketed comment + val failedToken = tokenStream.get(tokenStream.size() - 2) + assert(failedToken.getType() == SqlBaseParser.BRACKETED_COMMENT) + val position = Origin(Option(failedToken.getLine), Option(failedToken.getCharPositionInLine)) + throw QueryParsingErrors.unclosedBracketedCommentError(sqlText, position) +} + +singleStatementContext + } + /** Creates LogicalPlan for a given SQL string. */ override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser => -astBuilder.visitSingleStatement(parser.singleStatement()) match { +astBuilder.visitSingleStatement(singleStatementWithCheck(parser, sqlText)) match { Review comment: does this work for `parseExpression` and others? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755169306 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -57,11 +66,15 @@ case class PartitionedFile( class FileScanRDD( @transient private val sparkSession: SparkSession, readFunction: (PartitionedFile) => Iterator[InternalRow], -@transient val filePartitions: Seq[FilePartition]) +@transient val filePartitions: Seq[FilePartition], +val requiredSchema: StructType = StructType(Seq.empty), Review comment: nit: `dataSchema` is probably a better name in this context. We need to join the data rows with metadata col w.r.t. this schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755169884 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -57,11 +66,15 @@ case class PartitionedFile( class FileScanRDD( @transient private val sparkSession: SparkSession, readFunction: (PartitionedFile) => Iterator[InternalRow], -@transient val filePartitions: Seq[FilePartition]) +@transient val filePartitions: Seq[FilePartition], +val requiredSchema: StructType = StructType(Seq.empty), +val metadataStruct: Option[AttributeReference] = None) Review comment: nit: `metadataStructCol` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755176293 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -103,6 +116,135 @@ class FileScanRDD( context.killTaskIfInterrupted() (currentIterator != null && currentIterator.hasNext) || nextIterator() } + + /// + // FILE METADATA METHODS // + /// + + // whether a metadata column exists and it is a `MetadataAttribute` + private lazy val hasMetadataAttribute: Boolean = { +metadataStruct.exists { + case MetadataAttribute(_) => true + case _ => false +} + } + + // metadata struct unsafe row, will only be updated when the current file is changed + @volatile private var metadataStructUnsafeRow: UnsafeRow = _ + // metadata generic row, will only be updated when the current file is changed + @volatile private var metadataStructGenericRow: Row = _ + // an unsafe joiner to join an unsafe row with the metadata unsafe row + lazy private val unsafeRowJoiner = +if (hasMetadataAttribute) + GenerateUnsafeRowJoiner.create(requiredSchema, Seq(metadataStruct.get).toStructType) + + // Create a off/on heap WritableColumnVector + private def createColumnVector(numRows: Int, dataType: DataType): WritableColumnVector = { +if (offHeapColumnVectorEnabled) { + new OffHeapColumnVector(numRows, dataType) +} else { + new OnHeapColumnVector(numRows, dataType) +} + } + + /** + * For each partitioned file, metadata columns for each record in the file are exactly same. + * Only update metadata columns when `currentFile` is changed. + */ + private def updateMetadataStruct(): Unit = { +if (hasMetadataAttribute) { + val meta = metadataStruct.get + if (currentFile == null) { +metadataStructUnsafeRow = new UnsafeRow(1) +metadataStructGenericRow = new GenericRow(1) + } else { +// make an generic row +assert(meta.dataType.isInstanceOf[StructType]) +metadataStructGenericRow = Row.fromSeq( + meta.dataType.asInstanceOf[StructType].names.map { +case FILE_PATH => UTF8String.fromString(new File(currentFile.filePath).toString) +case FILE_NAME => UTF8String.fromString( + currentFile.filePath.split("/").last) Review comment: path separator is not always `/`. Let's use `java.io.File.separator` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755181619 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ## @@ -103,6 +116,135 @@ class FileScanRDD( context.killTaskIfInterrupted() (currentIterator != null && currentIterator.hasNext) || nextIterator() } + + /// + // FILE METADATA METHODS // + /// + + // whether a metadata column exists and it is a `MetadataAttribute` + private lazy val hasMetadataAttribute: Boolean = { +metadataStruct.exists { + case MetadataAttribute(_) => true + case _ => false +} + } + + // metadata struct unsafe row, will only be updated when the current file is changed + @volatile private var metadataStructUnsafeRow: UnsafeRow = _ + // metadata generic row, will only be updated when the current file is changed + @volatile private var metadataStructGenericRow: Row = _ + // an unsafe joiner to join an unsafe row with the metadata unsafe row + lazy private val unsafeRowJoiner = +if (hasMetadataAttribute) + GenerateUnsafeRowJoiner.create(requiredSchema, Seq(metadataStruct.get).toStructType) + + // Create a off/on heap WritableColumnVector + private def createColumnVector(numRows: Int, dataType: DataType): WritableColumnVector = { +if (offHeapColumnVectorEnabled) { + new OffHeapColumnVector(numRows, dataType) +} else { + new OnHeapColumnVector(numRows, dataType) +} + } + + /** + * For each partitioned file, metadata columns for each record in the file are exactly same. + * Only update metadata columns when `currentFile` is changed. + */ + private def updateMetadataStruct(): Unit = { +if (hasMetadataAttribute) { + val meta = metadataStruct.get + if (currentFile == null) { +metadataStructUnsafeRow = new UnsafeRow(1) +metadataStructGenericRow = new GenericRow(1) + } else { +// make an generic row +assert(meta.dataType.isInstanceOf[StructType]) +metadataStructGenericRow = Row.fromSeq( Review comment: Why do we create Row here and then use `CatalystTypeConverters` later? Let's just use `InternalRow.fromSeq` here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL
cloud-fan commented on a change in pull request #34575: URL: https://github.com/apache/spark/pull/34575#discussion_r755185991 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala ## @@ -212,7 +212,9 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { val outputSchema = readDataColumns.toStructType logInfo(s"Output Data Schema: ${outputSchema.simpleString(5)}") - val outputAttributes = readDataColumns ++ partitionColumns + // outputAttributes should also include referenced metadata columns at the every end + val outputAttributes = readDataColumns ++ partitionColumns ++ +plan.references.collect { case MetadataAttribute(attr) => attr } Review comment: shall we use `requiredAttributes` instead of `plan.references`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
SparkQA commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-976633320 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50021/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
AmplabJenkins commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-97983 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50021/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #34689: [SPARK-37445][BUILD] Upgrade hadoop profile to hadoop-3.3 since we support hadoop-3.3 as default now
srowen commented on a change in pull request #34689: URL: https://github.com/apache/spark/pull/34689#discussion_r755239983 ## File path: hadoop-cloud/pom.xml ## @@ -201,7 +201,7 @@ enables store-specific committers. --> - hadoop-3.2 + hadoop-3.3 Review comment: Same comment as hive-2.3, but, here may be better to change the name to make sure callers are clear it's not Hadoop 3.2. How about avoiding this going forward with `hadoop-3`? I doubt we're going to publish builds separately for two different Hadoop 3 versions in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34634: [SPARK-37357][SQL] Create skew partition specs should respect min partition size
cloud-fan commented on a change in pull request #34634: URL: https://github.com/apache/spark/pull/34634#discussion_r755239272 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala ## @@ -316,21 +316,25 @@ object ShufflePartitionsUtil extends Logging { * start of a partition. */ // Visible for testing - private[sql] def splitSizeListByTargetSize(sizes: Seq[Long], targetSize: Long): Array[Int] = { + private[sql] def splitSizeListByTargetSize( + sizes: Seq[Long], + targetSize: Long, + minPartitionSize: Long): Array[Int] = { val partitionStartIndices = ArrayBuffer[Int]() partitionStartIndices += 0 var i = 0 var currentPartitionSize = 0L var lastPartitionSize = -1L +val minSize = + Math.min(targetSize, Math.max(minPartitionSize, targetSize * SMALL_PARTITION_FACTOR)) def tryMergePartitions() = { // When we are going to start a new partition, it's possible that the current partition or // the previous partition is very small and it's better to merge the current partition into // the previous partition. val shouldMergePartitions = lastPartitionSize > -1 && ((currentPartitionSize + lastPartitionSize) < targetSize * MERGED_PARTITION_FACTOR || Review comment: I see a conflict here: 1. we want to make each partition smaller than `targetSize * MERGED_PARTITION_FACTOR` 2. we want to make each partition larger than `targetSize * SMALL_PARTITION_FACTOR` Seems like we prioritize 2 over 1, but in some cases `SMALL_PARTITION_FACTOR` needs to be smaller. How about we make `SMALL_PARTITION_FACTOR` a parameter of the function? Then we can pass different values from skew join rule and rebalance rule. e.g. The value can be `0.1` in the rebalance rule, or even making it configurable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth edited a comment on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
peter-toth edited a comment on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-97672 This change also seem to work with MSSQL's temp table syntax: ``` val withClause = "(SELECT * INTO #TempTable FROM (SELECT * FROM tbl WHERE x > 10) t)" val query = "SELECT * FROM #TempTable" val df = spark.read.format("jdbc") .option("url", jdbcUrl) .option("withClause", withClause) .option("query", query) .load() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on pull request #34622: [SPARK-37340][UI] Display StageIds in Operators for SQL UI
tgravescs commented on pull request #34622: URL: https://github.com/apache/spark/pull/34622#issuecomment-976749417 yes, it would be nice to have the actual stagIds in the ui, I'll need to look closer at the logic though, which likely won't be til next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-976763063 **[Test build #145550 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145550/testReport)** for PR 34671 at commit [`61fe9fb`](https://github.com/apache/spark/commit/61fe9fb6bc09863aa74ab82ac5d653e2b922ae0a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tdg5 commented on a change in pull request #29024: [SPARK-32001][SQL]Create JDBC authentication provider developer API
tdg5 commented on a change in pull request #29024: URL: https://github.com/apache/spark/pull/29024#discussion_r755337180 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala ## @@ -18,60 +18,45 @@ package org.apache.spark.sql.execution.datasources.jdbc.connection import java.sql.{Connection, Driver} -import java.util.Properties +import java.util.ServiceLoader -import org.apache.spark.internal.Logging -import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions - -/** - * Connection provider which opens connection toward various databases (database specific instance - * needed). If kerberos authentication required then it's the provider's responsibility to set all - * the parameters. - */ -private[jdbc] trait ConnectionProvider { - /** - * Additional properties for data connection (Data source property takes precedence). - */ - def getAdditionalProperties(): Properties = new Properties() +import scala.collection.mutable - /** - * Opens connection toward the database. - */ - def getConnection(): Connection -} +import org.apache.spark.internal.Logging +import org.apache.spark.security.SecurityConfigurationLock +import org.apache.spark.sql.jdbc.JdbcConnectionProvider +import org.apache.spark.util.Utils private[jdbc] object ConnectionProvider extends Logging { - def create(driver: Driver, options: JDBCOptions): ConnectionProvider = { -if (options.keytab == null || options.principal == null) { - logDebug("No authentication configuration found, using basic connection provider") - new BasicConnectionProvider(driver, options) -} else { - logDebug("Authentication configuration found, using database specific connection provider") - options.driverClass match { -case PostgresConnectionProvider.driverClass => - logDebug("Postgres connection provider found") - new PostgresConnectionProvider(driver, options) - -case MariaDBConnectionProvider.driverClass => - logDebug("MariaDB connection provider found") - new MariaDBConnectionProvider(driver, options) - -case DB2ConnectionProvider.driverClass => - logDebug("DB2 connection provider found") - new DB2ConnectionProvider(driver, options) - -case MSSQLConnectionProvider.driverClass => - logDebug("MS SQL connection provider found") - new MSSQLConnectionProvider(driver, options) - -case OracleConnectionProvider.driverClass => - logDebug("Oracle connection provider found") - new OracleConnectionProvider(driver, options) - -case _ => - throw new IllegalArgumentException(s"Driver ${options.driverClass} does not support " + -"Kerberos authentication") + private val providers = loadProviders() + + def loadProviders(): Seq[JdbcConnectionProvider] = { +val loader = ServiceLoader.load(classOf[JdbcConnectionProvider], + Utils.getContextOrSparkClassLoader) +val providers = mutable.ArrayBuffer[JdbcConnectionProvider]() + +val iterator = loader.iterator +while (iterator.hasNext) { + try { +val provider = iterator.next +logDebug(s"Loaded built in provider: $provider") +providers += provider + } catch { +case t: Throwable => + logError(s"Failed to load built in provider.", t) } } +// Seems duplicate but it's needed for Scala 2.13 +providers.toSeq + } + + def create(driver: Driver, options: Map[String, String]): Connection = { +val filteredProviders = providers.filter(_.canHandle(driver, options)) +require(filteredProviders.size == 1, + "JDBC connection initiated but not exactly one connection provider found which can handle " + +s"it. Found active providers: ${filteredProviders.mkString(", ")}") +SecurityConfigurationLock.synchronized { Review comment: I don't know anything about the underlying problem, but it had crossed my mind that allowing the synchronization to be optional could be one path forward. @gaborgsomogyi may I trouble you for some references related to > The main problem is that JVM has a single security context. In order to make a connection one MUST modify the security context, otherwise the JDBC connector is not able to provide the security credentials (TGT, keytab or whatever) so I can better familiarize myself with the problem that you are describing? Beyond this particular issue, what you've shared suggests that the concurrency utilized by my app could be causing us to crosswire data which would be a major problem. I guess I'd also ask, is there more to it than you described? It sounds like I should either have some cross wired data or if that's not the case then there is some missing piece of the puzzle that means the synchronization is not always required. Thanks in advance! --
[GitHub] [spark] ChenMichael edited a comment on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled
ChenMichael edited a comment on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-976849576 In order for this problem to manifest, we have to do join planning in between the time an InMemoryRelation is converted to a rdd and the time where the job executing this rdd completes. In AQE, since it repeatedly does replanning, this can happen when the InMemoryRelations are on different levels. With AQE off, it only does join planning once, so there's no scenario where part of the query materializes the InMemoryRelation and then join planning happens on another part of the query with inaccurate stats. I guess this could happen with AQE off if there are concurrent jobs sharing the same InMemoryRelation though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments
SparkQA commented on pull request #34668: URL: https://github.com/apache/spark/pull/34668#issuecomment-976912054 **[Test build #145549 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145549/testReport)** for PR 34668 at commit [`24d3046`](https://github.com/apache/spark/commit/24d3046b5c3c58c2e2d2edf7a9f6fa095ba75682). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-976940851 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50022/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source
sadikovi commented on a change in pull request #34596: URL: https://github.com/apache/spark/pull/34596#discussion_r755539240 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala ## @@ -442,17 +442,22 @@ object DateTimeUtils { /** * Trims and parses a given UTF8 string to a corresponding [[Long]] value which representing the - * number of microseconds since the epoch. The result is independent of time zones, - * which means that zone ID in the input string will be ignored. + * number of microseconds since the epoch. The result will be independent of time zones. + * + * If the input string contains a component associated with time zone, the method will return + * `None` if `ignoreTimeZone` is set to `false`. If `ignoreTimeZone` is set to `true`, the method + * will simply discard the time zone component. Enable the check to detect situations like parsing + * a timestamp with time zone as TimestampNTZType. + * * The return type is [[Option]] in order to distinguish between 0L and null. Please * refer to `parseTimestampString` for the allowed formats. */ - def stringToTimestampWithoutTimeZone(s: UTF8String): Option[Long] = { + def stringToTimestampWithoutTimeZone(s: UTF8String, ignoreTimeZone: Boolean): Option[Long] = { Review comment: I understand, but we don't fail if ignoreTimeZone was set to `false`, we return an Option instead. I can update in TimestampFormatter but `failOnError` in `stringToTimestampWithoutTimeZone` might be a bit misleading. Let me know what you think. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down
SparkQA commented on pull request #34593: URL: https://github.com/apache/spark/pull/34593#issuecomment-977249138 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50023/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down
SparkQA commented on pull request #34593: URL: https://github.com/apache/spark/pull/34593#issuecomment-977252277 **[Test build #145552 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145552/testReport)** for PR 34593 at commit [`ed83e8b`](https://github.com/apache/spark/commit/ed83e8b10196c97b1f04d614af9233fd8579b148). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ChenMichael edited a comment on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled
ChenMichael edited a comment on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-976849576 In order for this problem to manifest, we have to do join planning between the time a InMemoryRelation is converted to an rdd and the time where the job executing this rdd completes. In AQE, since it repeatedly does replanning, this can happen when the InMemoryRelations are on different levels. I guess this could happen with AQE off if there are concurrent jobs sharing the same InMemoryRelation as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ChenMichael edited a comment on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled
ChenMichael edited a comment on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-976849576 In order for this problem to manifest, we have to do join planning in between the time an InMemoryRelation is converted to a rdd and the time where the job executing this rdd completes. In AQE, since it repeatedly does replanning, this can happen when the InMemoryRelations are on different levels. I guess this could happen with AQE off if there are concurrent jobs sharing the same InMemoryRelation as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
SparkQA commented on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-976884853 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50022/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source
sadikovi commented on a change in pull request #34596: URL: https://github.com/apache/spark/pull/34596#discussion_r755538280 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala ## @@ -164,6 +164,10 @@ class CSVOptions( s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]" }) + val timestampNTZFormatInRead: Option[String] = parameters.get("timestampNTZFormat") Review comment: How do you mean? Actually, it is the same config but reads require an `Option[String]` and write require `String` (similar to `timestampFormatInRead` and `timestampFormatInWrite`). I can update to keep only one property `timestampNTZFormat` instead. Would that work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source
sadikovi commented on a change in pull request #34596: URL: https://github.com/apache/spark/pull/34596#discussion_r755538940 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala ## @@ -66,10 +68,23 @@ sealed trait TimestampFormatter extends Serializable { @throws(classOf[DateTimeParseException]) @throws(classOf[DateTimeException]) @throws(classOf[IllegalStateException]) - def parseWithoutTimeZone(s: String): Long = + def parseWithoutTimeZone(s: String, ignoreTimeZone: Boolean): Long = Review comment: Yes, I will update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down
SparkQA commented on pull request #34593: URL: https://github.com/apache/spark/pull/34593#issuecomment-977218053 **[Test build #145551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145551/testReport)** for PR 34593 at commit [`b7c2cc6`](https://github.com/apache/spark/commit/b7c2cc608f96f362147f6518e7bc596794e2948b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py
AmplabJenkins removed a comment on pull request #34671: URL: https://github.com/apache/spark/pull/34671#issuecomment-976848362 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145550/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC
AmplabJenkins removed a comment on pull request #34693: URL: https://github.com/apache/spark/pull/34693#issuecomment-976848358 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145548/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #34642: [SPARK-37369][SQL] Avoid redundant ColumnarToRow transistion on InMemoryTableScan
viirya commented on pull request #34642: URL: https://github.com/apache/spark/pull/34642#issuecomment-977016485 > I'm trying to understand the motivation. Is it because in-memory table can output rows efficiently? Parquet scan can also output rows but we try our best to output columnar batches. For Parquet scan, when we say it to output columnar batches, actually it behaves quite different than row-based approach because it runs vectorized Parquet reader. I think this is why we try our best to do columnar batches on Parquet or Orc scan because vectorized reader usually has much better performance which can counteract the cost of columnar-to-row transition if any later. For in-memory table, it is not actually doing a physical disk scan but the data is already serialized in memory. The motivation is that during local experiments I found columnar-to-row transition is costly and the columnar output looks meaningless. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34386: [WIP] - Changes to PySpark doc homepage and User Guide
SparkQA commented on pull request #34386: URL: https://github.com/apache/spark/pull/34386#issuecomment-977462625 **[Test build #145553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145553/testReport)** for PR 34386 at commit [`4a22e48`](https://github.com/apache/spark/commit/4a22e485119c01af67d26ebd85ff39451d99bf54). * This patch **fails PySpark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34386: [WIP] - Changes to PySpark doc homepage and User Guide
AmplabJenkins commented on pull request #34386: URL: https://github.com/apache/spark/pull/34386#issuecomment-977463411 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145553/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #34689: [SPARK-37445][BUILD] Upgrade hadoop profile to hadoop-3.3 since we support hadoop-3.3 as default now
srowen commented on a change in pull request #34689: URL: https://github.com/apache/spark/pull/34689#discussion_r755671589 ## File path: hadoop-cloud/pom.xml ## @@ -201,7 +201,7 @@ enables store-specific committers. --> - hadoop-3.2 + hadoop-3.3 Review comment: I see, are we going to need a separate Hadoop 3.2 build at all going forward? If not, I think it makes sense to call it hadoop-3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down
AmplabJenkins commented on pull request #34593: URL: https://github.com/apache/spark/pull/34593#issuecomment-977464156 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145551/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34679: [SPARK-37437][BUILD] Remove unused hive profile
AngersZh commented on a change in pull request #34679: URL: https://github.com/apache/spark/pull/34679#discussion_r755672634 ## File path: pom.xml ## @@ -3353,11 +3353,6 @@ - Review comment: > We could possibly leave the profile in and have it still do nothing, as if the profile doesn't exist, a build with -Phive-2.3 fails, even though it's really "OK" - it's already hive 2.3 by default. I don't feel strongly about it. you mean `test-hive2.3`? I think this won't work now and we don't need this too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34679: [SPARK-37437][BUILD] Remove unused hive profile
AngersZh commented on a change in pull request #34679: URL: https://github.com/apache/spark/pull/34679#discussion_r755674382 ## File path: pom.xml ## @@ -3353,11 +3353,6 @@ - Review comment: > No, I just mean do not remove the profile, so that specifying it does not fail the build But build with a `-Phive-2.3` won't failed too... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org