date:20211123

[GitHub] [spark] HyukjinKwon closed pull request #34687: [SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series

2021-11-23 Thread GitBox



HyukjinKwon closed pull request #34687:
URL: https://github.com/apache/spark/pull/34687


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34687: [SPARK-36231][PYTHON] Support arithmetic operations of decimal(nan) series

2021-11-23 Thread GitBox



HyukjinKwon commented on pull request #34687:
URL: https://github.com/apache/spark/pull/34687#issuecomment-976348748


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark

2021-11-23 Thread GitBox



SparkQA commented on pull request #34677:
URL: https://github.com/apache/spark/pull/34677#issuecomment-976432192


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50019/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth opened a new pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



peter-toth opened a new pull request #34693:
URL: https://github.com/apache/spark/pull/34693


   ### What changes were proposed in this pull request?
   CTE queries are not supported with MSSQL server via JDBC as MSSQL server 
doesn't support statements with nested CTEs. When Spark builds the final query, 
that it will send via JDBC to the server, it wraps the original query 
(`options.tableOrQuery`) into parentheses in `JDBCRDD.resolveTable()` and 
`JDBCRDD.compute()`.
   Unfortunately, it is non-trivial to split an arbitrary query it into "with" 
and "regular" query clauses in `MsSqlServerDialect`. So instead, I'm proposing 
a new general JDBC option "withClause" that users can use if they have complex 
queries with CTE.
   
   ### Why are the changes needed?
   To support CTE queries with MSSQL.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, CTE queries are supported form now.
   
   ### How was this patch tested?
   Added new integration UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



SparkQA commented on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-976489807


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50020/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-976490264


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145540/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark

2021-11-23 Thread GitBox



AmplabJenkins removed a comment on pull request #34677:
URL: https://github.com/apache/spark/pull/34677#issuecomment-976494224


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145547/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34691: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val

2021-11-23 Thread GitBox



SparkQA removed a comment on pull request #34691:
URL: https://github.com/apache/spark/pull/34691#issuecomment-976296694


   **[Test build #145544 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145544/testReport)**
 for PR 34691 at commit 
[`0c393a5`](https://github.com/apache/spark/commit/0c393a586b2a172c0138eeacf7552d8561157af4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34668:
URL: https://github.com/apache/spark/pull/34668#discussion_r755141437



##
File path: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##
@@ -73,6 +78,19 @@ grammar SqlBase;
   return false;
 }
   }
+
+  /**
+   * This method will be called when the character stream ends and try to find 
out the
+   * unclosed bracketed comment.
+   * If the next character is -1, it means the end of the entire character 
stream match,
+   * and we throw exception to prevent executing the query.
+   */
+  public void end() {

Review comment:
   nit: `checkUnclosedComment`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755165986



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##
@@ -355,7 +377,14 @@ case class FileSourceScanExec(
   @transient
   private lazy val pushedDownFilters = {
 val supportNestedPredicatePushdown = 
DataSourceUtils.supportNestedPredicatePushdown(relation)
-dataFilters.flatMap(DataSourceStrategy.translateFilter(_, 
supportNestedPredicatePushdown))
+dataFilters
+  .filterNot(
+_.references.exists {
+  case MetadataAttribute(_) => true

Review comment:
   can we leave a TODO comment here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755170678



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -103,6 +116,135 @@ class FileScanRDD(
 context.killTaskIfInterrupted()
 (currentIterator != null && currentIterator.hasNext) || nextIterator()
   }
+
+  ///
+  // FILE METADATA METHODS //
+  ///
+
+  // whether a metadata column exists and it is a `MetadataAttribute`
+  private lazy val hasMetadataAttribute: Boolean = {

Review comment:
   do we need this `lazy val`? I think this is simply 
`metadataStructCol.isDefined`, as the caller side should always pass in the 
metadata col.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



AmplabJenkins removed a comment on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-97983


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50021/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755214307



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends 
Rule[LogicalPlan] with PredicateHelpe
 filter
   }
 
+case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+  w.windowExpressions match {
+case Seq(alias @ Alias(WindowExpression(_: RowNumber, 
WindowSpecDefinition(Nil, orderSpec,
+SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), 
_)) =>
+  val aliasAttr = alias.toAttribute
+  val limitValue = splitConjunctivePredicates(condition).collectFirst {
+case LessThanOrEqual(e, IntegerLiteral(v)) if 
e.semanticEquals(aliasAttr) => v
+case Equality(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) 
=> v
+case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) 
=> v - 1
+  }
+
+  limitValue match {
+case Some(lv) if lv <= 0 =>
+  LocalRelation(filter.output, data = Seq.empty, isStreaming = 
filter.isStreaming)
+case Some(lv)
+if lv < conf.topKSortFallbackThreshold && 
w.child.maxRows.forall(_ > lv) =>
+  filter.copy(child =
+w.copy(child = Limit(Literal(lv), Sort(orderSpec, true, 
w.child

Review comment:
   BTW, is `orderSpec` always specified?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #34454: [SPARK-37013][CORE][SQL][FOLLOWUP] Use the new error framework to throw error in `FormatString`

2021-11-23 Thread GitBox



cloud-fan closed pull request #34454:
URL: https://github.com/apache/spark/pull/34454


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #34454: [SPARK-37013][CORE][SQL][FOLLOWUP] Use the new error framework to throw error in `FormatString`

2021-11-23 Thread GitBox



cloud-fan commented on pull request #34454:
URL: https://github.com/apache/spark/pull/34454#issuecomment-976686019


   Sorry I missed the ping. Merging to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #34679: [SPARK-37437][BUILD] Remove unused hive profile

2021-11-23 Thread GitBox



srowen commented on a change in pull request #34679:
URL: https://github.com/apache/spark/pull/34679#discussion_r755238638



##
File path: pom.xml
##
@@ -3353,11 +3353,6 @@
   
 
 
-

Review comment:
   We could possibly leave the profile in and have it still do nothing, as 
if the profile doesn't exist, a build with -Phive-2.3 fails, even though it's 
really "OK" - it's already hive 2.3 by default. I don't feel strongly about it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34668:
URL: https://github.com/apache/spark/pull/34668#discussion_r755142758



##
File path: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##
@@ -73,6 +78,19 @@ grammar SqlBase;
   return false;
 }
   }
+
+  /**
+   * This method will be called when the character stream ends and try to find 
out the
+   * unclosed bracketed comment.
+   * If the next character is -1, it means the end of the entire character 
stream match,

Review comment:
   BTW do we need to check? looking at `('*/' | {end();} EOF)`, seems it's 
always EOF if we enter this method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34668:
URL: https://github.com/apache/spark/pull/34668#discussion_r755141872



##
File path: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##
@@ -73,6 +78,19 @@ grammar SqlBase;
   return false;
 }
   }
+
+  /**
+   * This method will be called when the character stream ends and try to find 
out the
+   * unclosed bracketed comment.
+   * If the next character is -1, it means the end of the entire character 
stream match,
+   * and we throw exception to prevent executing the query.

Review comment:
   not throw exception, we set the flag and fail later.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755178439



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -103,6 +116,135 @@ class FileScanRDD(
 context.killTaskIfInterrupted()
 (currentIterator != null && currentIterator.hasNext) || nextIterator()
   }
+
+  ///
+  // FILE METADATA METHODS //
+  ///
+
+  // whether a metadata column exists and it is a `MetadataAttribute`
+  private lazy val hasMetadataAttribute: Boolean = {
+metadataStruct.exists {
+  case MetadataAttribute(_) => true
+  case _ => false
+}
+  }
+
+  // metadata struct unsafe row, will only be updated when the current 
file is changed
+  @volatile private var metadataStructUnsafeRow: UnsafeRow = _
+  // metadata generic row, will only be updated when the current file is 
changed
+  @volatile private var metadataStructGenericRow: Row = _
+  // an unsafe joiner to join an unsafe row with the metadata unsafe row
+  lazy private val unsafeRowJoiner =
+if (hasMetadataAttribute)
+  GenerateUnsafeRowJoiner.create(requiredSchema, 
Seq(metadataStruct.get).toStructType)
+
+  // Create a off/on heap WritableColumnVector
+  private def createColumnVector(numRows: Int, dataType: DataType): 
WritableColumnVector = {
+if (offHeapColumnVectorEnabled) {
+  new OffHeapColumnVector(numRows, dataType)
+} else {
+  new OnHeapColumnVector(numRows, dataType)
+}
+  }
+
+  /**
+   * For each partitioned file, metadata columns for each record in the 
file are exactly same.
+   * Only update metadata columns when `currentFile` is changed.
+   */
+  private def updateMetadataStruct(): Unit = {
+if (hasMetadataAttribute) {
+  val meta = metadataStruct.get
+  if (currentFile == null) {
+metadataStructUnsafeRow = new UnsafeRow(1)
+metadataStructGenericRow = new GenericRow(1)

Review comment:
   seem we can just null out these 2 rows. If `currentFile` is null, we 
won't output any more records, and these 2 rows are useless.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #34642: [SPARK-37369][SQL] Avoid redundant ColumnarToRow transistion on InMemoryTableScan

2021-11-23 Thread GitBox



cloud-fan commented on pull request #34642:
URL: https://github.com/apache/spark/pull/34642#issuecomment-976690695


   I'm trying to understand the motivation. Is it because in-memory table can 
output rows efficiently? Parquet scan can also output rows but we try our best 
to output columnar batches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #34692: [SPARK-11792][FOLLOWUP] Update scaladoc of KnownSizeEstimation

2021-11-23 Thread GitBox



srowen commented on pull request #34692:
URL: https://github.com/apache/spark/pull/34692#issuecomment-976711598


   Because that JIRA is so old, we wouldn't really treat this as part of that 
JIRA. I'll just make it a minor docs PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source

2021-11-23 Thread GitBox



gengliangwang commented on a change in pull request #34596:
URL: https://github.com/apache/spark/pull/34596#discussion_r755249223



##
File path: 
sql/core/src/test/resources/sql-tests/results/timestampNTZ/timestamp.sql.out
##
@@ -373,17 +374,19 @@ structhttps://user-images.githubusercontent.com/1097932/143054054-2cc408d6-0894-4584-9034-4561fe82ab6c.png)
   
   I look it up in the ANSI SQL spec. The string input here can be considered 
as Timestamp with TZ. I think we should allow casting timestamp string with TZ 
to TimestampNTZ.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



peter-toth commented on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-976828400


   Hmm, failures in `ExpressionsSchemaSuite` look unrelated...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34691: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val

2021-11-23 Thread GitBox



SparkQA commented on pull request #34691:
URL: https://github.com/apache/spark/pull/34691#issuecomment-976549263


   **[Test build #145544 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145544/testReport)**
 for PR 34691 at commit 
[`0c393a5`](https://github.com/apache/spark/commit/0c393a586b2a172c0138eeacf7552d8561157af4).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34070: [SPARK-36840][SQL] Support DPP if there is no selective predicate on the filtering side

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34070:
URL: https://github.com/apache/spark/pull/34070#issuecomment-976570564


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145545/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34070: [SPARK-36840][SQL] Support DPP if there is no selective predicate on the filtering side

2021-11-23 Thread GitBox



AmplabJenkins removed a comment on pull request #34070:
URL: https://github.com/apache/spark/pull/34070#issuecomment-976570564


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145545/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



AmplabJenkins removed a comment on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-976570563


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50020/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34691: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val

2021-11-23 Thread GitBox



AmplabJenkins removed a comment on pull request #34691:
URL: https://github.com/apache/spark/pull/34691#issuecomment-976570562


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755163320



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##
@@ -194,10 +195,22 @@ case class FileSourceScanExec(
 disableBucketedScan: Boolean = false)
   extends DataSourceScanExec {
 
+  lazy val outputMetadataStruct: Option[AttributeReference] =
+output.collectFirst { case MetadataAttribute(attr) => attr }
+
   // Note that some vals referring the file-based relation are lazy 
intentionally
   // so that this plan can be canonicalized on executor side too. See 
SPARK-23731.
   override lazy val supportsColumnar: Boolean = {
-relation.fileFormat.supportBatch(relation.sparkSession, schema)
+// schema without the file metadata column
+val fileSchema = if (outputMetadataStruct.isEmpty) schema else {
+  StructType.fromAttributes(

Review comment:
   nit: `StructType.fromAttributes(output.filterNot(_.exprId == 
metadataStructCol.get.exprId))`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755163320



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##
@@ -194,10 +195,22 @@ case class FileSourceScanExec(
 disableBucketedScan: Boolean = false)
   extends DataSourceScanExec {
 
+  lazy val outputMetadataStruct: Option[AttributeReference] =
+output.collectFirst { case MetadataAttribute(attr) => attr }
+
   // Note that some vals referring the file-based relation are lazy 
intentionally
   // so that this plan can be canonicalized on executor side too. See 
SPARK-23731.
   override lazy val supportsColumnar: Boolean = {
-relation.fileFormat.supportBatch(relation.sparkSession, schema)
+// schema without the file metadata column
+val fileSchema = if (outputMetadataStruct.isEmpty) schema else {
+  StructType.fromAttributes(

Review comment:
   nit: `output.filterNot(_.exprId == 
metadataStructCol.get.exprId).toStruct`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755163320



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##
@@ -194,10 +195,22 @@ case class FileSourceScanExec(
 disableBucketedScan: Boolean = false)
   extends DataSourceScanExec {
 
+  lazy val outputMetadataStruct: Option[AttributeReference] =
+output.collectFirst { case MetadataAttribute(attr) => attr }
+
   // Note that some vals referring the file-based relation are lazy 
intentionally
   // so that this plan can be canonicalized on executor side too. See 
SPARK-23731.
   override lazy val supportsColumnar: Boolean = {
-relation.fileFormat.supportBatch(relation.sparkSession, schema)
+// schema without the file metadata column
+val fileSchema = if (outputMetadataStruct.isEmpty) schema else {
+  StructType.fromAttributes(

Review comment:
   nit: `output.filter(_.exprId != metadataStructCol.get.exprId).toStruct`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755167684



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
##
@@ -171,6 +171,28 @@ trait FileFormat {
   def supportFieldName(name: String): Boolean = true
 }
 
+object FileFormat {
+
+  val FILE_PATH = "file_path"
+
+  val FILE_NAME = "file_name"
+
+  val FILE_SIZE = "file_size"
+
+  val FILE_MODIFICATION_TIME = "file_modification_time"
+
+  val METADATA_NAME = "_metadata"
+
+  val METADATA_STRUCT: DataType = new StructType()
+.add(StructField(FILE_PATH, StringType))
+.add(StructField(FILE_NAME, StringType))
+.add(StructField(FILE_SIZE, LongType))
+.add(StructField(FILE_MODIFICATION_TIME, LongType))
+
+  // supported metadata columns for hadoop fs relation
+  def FILE_METADATA_COLUMNS: AttributeReference = 
MetadataAttribute(METADATA_NAME, METADATA_STRUCT)

Review comment:
   since it's a `def` now, how about `def createFileMetadataCol`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34070: [SPARK-36840][SQL] Support DPP if there is no selective predicate on the filtering side

2021-11-23 Thread GitBox



SparkQA removed a comment on pull request #34070:
URL: https://github.com/apache/spark/pull/34070#issuecomment-976297392


   **[Test build #145545 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145545/testReport)**
 for PR 34070 at commit 
[`50484e2`](https://github.com/apache/spark/commit/50484e23c714e06057993757efa20034a261d2bf).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34070: [SPARK-36840][SQL] Support DPP if there is no selective predicate on the filtering side

2021-11-23 Thread GitBox



SparkQA commented on pull request #34070:
URL: https://github.com/apache/spark/pull/34070#issuecomment-976551994


   **[Test build #145545 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145545/testReport)**
 for PR 34070 at commit 
[`50484e2`](https://github.com/apache/spark/commit/50484e23c714e06057993757efa20034a261d2bf).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34668:
URL: https://github.com/apache/spark/pull/34668#discussion_r755150850



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala
##
@@ -78,9 +79,30 @@ abstract class AbstractSqlParser extends ParserInterface 
with SQLConfHelper with
 astBuilder.visitQuery(parser.query())
   }
 
+  /** check `has_unclosed_bracketed_comment` to find out the unclosed 
bracketed comment. */
+  private def singleStatementWithCheck(
+  parser: SqlBaseParser, sqlText: String): SingleStatementContext = {
+val singleStatementContext = parser.singleStatement()
+assert(parser.getTokenStream.isInstanceOf[CommonTokenStream])
+
+val tokenStream = parser.getTokenStream.asInstanceOf[CommonTokenStream]
+assert(tokenStream.getTokenSource.isInstanceOf[SqlBaseLexer])
+
+val lexer = tokenStream.getTokenSource.asInstanceOf[SqlBaseLexer]
+if (lexer.has_unclosed_bracketed_comment) {
+  // The last token is 'EOF' and the penultimate is unclosed bracketed 
comment
+  val failedToken = tokenStream.get(tokenStream.size() - 2)
+  assert(failedToken.getType() == SqlBaseParser.BRACKETED_COMMENT)
+  val position = Origin(Option(failedToken.getLine), 
Option(failedToken.getCharPositionInLine))
+  throw QueryParsingErrors.unclosedBracketedCommentError(sqlText, position)
+}
+
+singleStatementContext
+  }
+
   /** Creates LogicalPlan for a given SQL string. */
   override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { 
parser =>
-astBuilder.visitSingleStatement(parser.singleStatement()) match {
+astBuilder.visitSingleStatement(singleStatementWithCheck(parser, sqlText)) 
match {

Review comment:
   Can we check the unclosed comment using a listener in the `parse` method 
below?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755164613



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##
@@ -212,7 +225,16 @@ case class FileSourceScanExec(
 relation.fileFormat.vectorTypes(
   requiredSchema = requiredSchema,
   partitionSchema = relation.partitionSchema,
-  relation.sparkSession.sessionState.conf)
+  relation.sparkSession.sessionState.conf).map { vectorTypes =>
+val metadataVectorClz =
+  if 
(relation.sparkSession.sessionState.conf.offHeapColumnVectorEnabled) {
+classOf[OffHeapColumnVector].getName
+  } else {
+classOf[OnHeapColumnVector].getName

Review comment:
   since we will change to use a constant vector soon, how about we always 
use `OnHeapColumnVector` for now,  to simplify the code?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755183339



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -103,6 +116,135 @@ class FileScanRDD(
 context.killTaskIfInterrupted()
 (currentIterator != null && currentIterator.hasNext) || nextIterator()
   }
+
+  ///
+  // FILE METADATA METHODS //
+  ///
+
+  // whether a metadata column exists and it is a `MetadataAttribute`
+  private lazy val hasMetadataAttribute: Boolean = {
+metadataStruct.exists {
+  case MetadataAttribute(_) => true
+  case _ => false
+}
+  }
+
+  // metadata struct unsafe row, will only be updated when the current 
file is changed
+  @volatile private var metadataStructUnsafeRow: UnsafeRow = _
+  // metadata generic row, will only be updated when the current file is 
changed
+  @volatile private var metadataStructGenericRow: Row = _
+  // an unsafe joiner to join an unsafe row with the metadata unsafe row
+  lazy private val unsafeRowJoiner =
+if (hasMetadataAttribute)
+  GenerateUnsafeRowJoiner.create(requiredSchema, 
Seq(metadataStruct.get).toStructType)
+
+  // Create a off/on heap WritableColumnVector
+  private def createColumnVector(numRows: Int, dataType: DataType): 
WritableColumnVector = {
+if (offHeapColumnVectorEnabled) {
+  new OffHeapColumnVector(numRows, dataType)
+} else {
+  new OnHeapColumnVector(numRows, dataType)
+}
+  }
+
+  /**
+   * For each partitioned file, metadata columns for each record in the 
file are exactly same.
+   * Only update metadata columns when `currentFile` is changed.
+   */
+  private def updateMetadataStruct(): Unit = {
+if (hasMetadataAttribute) {
+  val meta = metadataStruct.get
+  if (currentFile == null) {
+metadataStructUnsafeRow = new UnsafeRow(1)
+metadataStructGenericRow = new GenericRow(1)
+  } else {
+// make an generic row
+assert(meta.dataType.isInstanceOf[StructType])
+metadataStructGenericRow = Row.fromSeq(

Review comment:
   And does `metadataStructGenericRow` need to be a class member? Seems can 
be a local variable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755188315



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataColumnsSuite.scala
##
@@ -0,0 +1,481 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.File
+import java.nio.file.Files
+import java.text.SimpleDateFormat
+
+import org.apache.spark.sql.{AnalysisException, Column, DataFrame, QueryTest, 
Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.types.{IntegerType, LongType, StringType, 
StructField, StructType}
+
+class FileMetadataColumnsSuite extends QueryTest with SharedSparkSession {
+
+  val data0: String =
+"""
+  |jack,24,12345,uom
+  |""".stripMargin
+
+  val data1: String =
+"""
+  |lily,31,,ucb
+  |""".stripMargin
+
+  val schema: StructType = new StructType()
+.add(StructField("name", StringType))
+.add(StructField("age", IntegerType))
+.add(StructField("id", LongType))
+.add(StructField("university", StringType))
+
+  val schemaWithNameConflicts: StructType = new StructType()
+.add(StructField("name", StringType))
+.add(StructField("age", IntegerType))
+.add(StructField("_metadata.file_size", LongType))
+.add(StructField("_metadata.FILE_NAME", StringType))
+
+  private val METADATA_FILE_PATH = "_metadata.file_path"
+
+  private val METADATA_FILE_NAME = "_metadata.file_name"
+
+  private val METADATA_FILE_SIZE = "_metadata.file_size"
+
+  private val METADATA_FILE_MODIFICATION_TIME = 
"_metadata.file_modification_time"
+
+  /**
+   * Create a CSV file named `fileName` with `data` under `dir` directory.
+   */
+  private def createCSVFile(data: String, dir: File, fileName: String): String 
= {
+val dataFile = new File(dir, s"/$fileName")
+dataFile.getParentFile.mkdirs()
+val bytes = data.getBytes()
+Files.write(dataFile.toPath, bytes)
+dataFile.getCanonicalPath
+  }
+
+  /**
+   * This test wrapper will test for both row-based and column-based file 
formats (csv and parquet)
+   * 1. read data0 and data1 and write them as testFileFormat: f0 and f1
+   * 2. read both f0 and f1, return the df to the downstream for further 
testing
+   * 3. construct actual metadata map for both f0 and f1 to the downstream for 
further testing
+   *
+   * The final df will have data:
+   * jack | 24 | 12345 | uom
+   * lily | 31 | null | ucb
+   *
+   * The schema of the df will be the `fileSchema` provided to this method
+   *
+   * This test wrapper will provide a `df` and actual metadata map `f0`, `f1`
+   */
+  private def metadataColumnsTest(
+  testName: String, fileSchema: StructType)
+(f: (DataFrame, Map[String, Any], Map[String, Any]) => Unit): Unit = {
+Seq("csv", "parquet").foreach { testFileFormat =>
+  test(s"metadata columns ($testFileFormat): " + testName) {
+withTempDir { dir =>
+  // read data0 as CSV and write f0 as testFileFormat
+  val df0 = spark.read.schema(fileSchema).csv(
+createCSVFile(data0, dir, "temp/0")
+  )
+  val f0Path = new File(dir, "data/f0").getCanonicalPath
+  df0.coalesce(1).write.format(testFileFormat).save(f0Path)
+
+  // read data1 as CSV and write f1 as testFileFormat
+  val df1 = spark.read.schema(fileSchema).csv(
+createCSVFile(data1, dir, "temp/1")
+  )
+  val f1Path = new File(dir, "data/f1").getCanonicalPath
+  df1.coalesce(1).write.format(testFileFormat).save(f1Path)
+
+  // read both f0 and f1
+  val df = spark.read.format(testFileFormat).schema(fileSchema)
+.load(new File(dir, "data").getCanonicalPath + "/*")
+
+  val realF0 = new File(dir, "data/f0").listFiles()
+.filter(_.getName.endsWith(s".$testFileFormat")).head
+
+  val realF1 = new File(dir, "data/f1").listFiles()
+.filter(_.getName.endsWith(s".$testFileFormat")).head
+
+  // construct f0 and f1 metadata data
+  val f0Metadata = Map(
+

[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755213627



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends 
Rule[LogicalPlan] with PredicateHelpe
 filter
   }
 
+case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+  w.windowExpressions match {
+case Seq(alias @ Alias(WindowExpression(_: RowNumber, 
WindowSpecDefinition(Nil, orderSpec,
+SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), 
_)) =>
+  val aliasAttr = alias.toAttribute
+  val limitValue = splitConjunctivePredicates(condition).collectFirst {
+case LessThanOrEqual(e, IntegerLiteral(v)) if 
e.semanticEquals(aliasAttr) => v
+case Equality(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) 
=> v
+case LessThan(e, IntegerLiteral(v)) if e.semanticEquals(aliasAttr) 
=> v - 1
+  }
+
+  limitValue match {
+case Some(lv) if lv <= 0 =>
+  LocalRelation(filter.output, data = Seq.empty, isStreaming = 
filter.isStreaming)
+case Some(lv)
+if lv < conf.topKSortFallbackThreshold && 
w.child.maxRows.forall(_ > lv) =>
+  filter.copy(child =
+w.copy(child = Limit(Literal(lv), Sort(orderSpec, true, 
w.child

Review comment:
   One worry is what if other optimizer rules put/move operators between 
Limit and Sort? Then we can't use the `TakeOrderedAndProjectExec` and introduce 
a big overhead by this global sort.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



peter-toth commented on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-97672


   This change also seem to work with MSSQL's temp table syntax:
   ```
   val withClause = "(SELECT * INTO #TempTable FROM (SELECT * FROM tbl) t)"
   val query = "SELECT * FROM #TempTable"
   val df = spark.read.format("jdbc")
 .option("url", jdbcUrl)
 .option("withClause", withClause)
 .option("query", query)
 .load()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-23 Thread GitBox



SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-976793578


   **[Test build #145550 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145550/testReport)**
 for PR 34671 at commit 
[`61fe9fb`](https://github.com/apache/spark/commit/61fe9fb6bc09863aa74ab82ac5d653e2b922ae0a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34677:
URL: https://github.com/apache/spark/pull/34677#issuecomment-976494224


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145547/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] thejdeep commented on pull request #34607: [SPARK-36038][CORE] Speculation metrics summary at stage level

2021-11-23 Thread GitBox



thejdeep commented on pull request #34607:
URL: https://github.com/apache/spark/pull/34607#issuecomment-976532846


   > > I changed maybeUpdate to update when we encounter a speculative task. 
The problem with the previous approach was that - if we have a speculative 
task, then future task end events would all make unnecessary speculation 
summary writes to DB. What do you think ?
   > 
   > How about changing `speculationStageSummary` in `LiveStage` to `var 
speculationStageSummary: Option[LiveSpeculationStageSummary]` and then, we can 
check whether there is any update about speculation summary. If it's `Some`, we 
can update with `maybeUpdate`.
   
   @sarutak Would not really prefer it since it would require changing a class 
val to a var. I am assuming this is the change you suggested : 
   ```
   AppStatusListener.scala
   
 if (event.taskInfo.speculative) {
   stage.speculationStageSummary = 
Some(stage.speculationStageSummary.getOrElse(
   new LiveSpeculationStageSummary(event.stageId, 
event.stageAttemptId)))
   val speculationStageSummary = stage.speculationStageSummary.get
   speculationStageSummary.numActiveTasks -= 1
   speculationStageSummary.numCompletedTasks += completedDelta
   speculationStageSummary.numFailedTasks += failedDelta
   speculationStageSummary.numKilledTasks += killedDelta
 }
   
 if(stage.speculationStageSummary.isDefined) {
   maybeUpdate(stage.speculationStageSummary.get, now)
 }
   ```
   and 
   ```
   LiveEntity.scala
   
   var speculationStageSummary: Option[LiveSpeculationStageSummary] = None
   ```
   What do you think ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755180169



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -103,6 +116,135 @@ class FileScanRDD(
 context.killTaskIfInterrupted()
 (currentIterator != null && currentIterator.hasNext) || nextIterator()
   }
+
+  ///
+  // FILE METADATA METHODS //
+  ///
+
+  // whether a metadata column exists and it is a `MetadataAttribute`
+  private lazy val hasMetadataAttribute: Boolean = {
+metadataStruct.exists {
+  case MetadataAttribute(_) => true
+  case _ => false
+}
+  }
+
+  // metadata struct unsafe row, will only be updated when the current 
file is changed
+  @volatile private var metadataStructUnsafeRow: UnsafeRow = _
+  // metadata generic row, will only be updated when the current file is 
changed
+  @volatile private var metadataStructGenericRow: Row = _
+  // an unsafe joiner to join an unsafe row with the metadata unsafe row
+  lazy private val unsafeRowJoiner =
+if (hasMetadataAttribute)
+  GenerateUnsafeRowJoiner.create(requiredSchema, 
Seq(metadataStruct.get).toStructType)
+
+  // Create a off/on heap WritableColumnVector
+  private def createColumnVector(numRows: Int, dataType: DataType): 
WritableColumnVector = {
+if (offHeapColumnVectorEnabled) {
+  new OffHeapColumnVector(numRows, dataType)
+} else {
+  new OnHeapColumnVector(numRows, dataType)
+}
+  }
+
+  /**
+   * For each partitioned file, metadata columns for each record in the 
file are exactly same.
+   * Only update metadata columns when `currentFile` is changed.
+   */
+  private def updateMetadataStruct(): Unit = {
+if (hasMetadataAttribute) {
+  val meta = metadataStruct.get
+  if (currentFile == null) {
+metadataStructUnsafeRow = new UnsafeRow(1)
+metadataStructGenericRow = new GenericRow(1)
+  } else {
+// make an generic row
+assert(meta.dataType.isInstanceOf[StructType])
+metadataStructGenericRow = Row.fromSeq(
+  meta.dataType.asInstanceOf[StructType].names.map {
+case FILE_PATH => UTF8String.fromString(new 
File(currentFile.filePath).toString)
+case FILE_NAME => UTF8String.fromString(
+  currentFile.filePath.split("/").last)
+case FILE_SIZE => currentFile.fileSize
+case FILE_MODIFICATION_TIME => currentFile.modificationTime
+case _ => None // be exhaustive, won't happen
+  }
+)
+
+// convert the generic row to an unsafe row
+val unsafeRowConverter = {
+  val converter = UnsafeProjection.create(
+Array(METADATA_STRUCT))

Review comment:
   seems safer to use `Array(meta.dataType.asInstanceOf[StructType])`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] nchammas commented on pull request #34655: [SPARK-37380][PYTHON] Miscellaneous Python lint infra cleanup

2021-11-23 Thread GitBox



nchammas commented on pull request #34655:
URL: https://github.com/apache/spark/pull/34655#issuecomment-976699676


   @HyukjinKwon - Is there anyone else you think should review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled

2021-11-23 Thread GitBox



cloud-fan commented on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-976714972


   Why is there no problem with AQE off?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #34692: [MINOR][DOCS] Update scaladoc of KnownSizeEstimation

2021-11-23 Thread GitBox



srowen commented on a change in pull request #34692:
URL: https://github.com/apache/spark/pull/34692#discussion_r755242633



##
File path: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala
##
@@ -33,10 +33,9 @@ import org.apache.spark.util.collection.OpenHashSet
 
 /**
  * A trait that allows a class to give [[SizeEstimator]] more accurate size 
estimation.
- * When a class extends it, [[SizeEstimator]] will query the `estimatedSize` 
first.
- * If `estimatedSize` does not return `None`, [[SizeEstimator]] will use the 
returned size
- * as the size of the object. Otherwise, [[SizeEstimator]] will do the 
estimation work.
- * The difference between a [[KnownSizeEstimation]] and
+ * When a class extends it, [[SizeEstimator]] will query the `estimatedSize`, 
and use
+ * the returned size as the size of the object. Otherwise, [[SizeEstimator]] 
will do

Review comment:
   Can you explain this more? I'm not clear what the difference is in these 
statements




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34596:
URL: https://github.com/apache/spark/pull/34596#discussion_r755252311



##
File path: 
sql/core/src/test/resources/sql-tests/results/timestampNTZ/timestamp.sql.out
##
@@ -373,17 +374,19 @@ struct

[GitHub] [spark] SparkQA removed a comment on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



SparkQA removed a comment on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-976453455


   **[Test build #145548 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145548/testReport)**
 for PR 34693 at commit 
[`e2c9577`](https://github.com/apache/spark/commit/e2c957718196f8269c56fb4e3c6a8b4d1eed3074).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



SparkQA commented on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-976827223


   **[Test build #145548 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145548/testReport)**
 for PR 34693 at commit 
[`e2c9577`](https://github.com/apache/spark/commit/e2c957718196f8269c56fb4e3c6a8b4d1eed3074).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34611: [SPARK-35867][SQL] Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-23 Thread GitBox



AmplabJenkins removed a comment on pull request #34611:
URL: https://github.com/apache/spark/pull/34611#issuecomment-976490264


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145540/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark

2021-11-23 Thread GitBox



SparkQA commented on pull request #34677:
URL: https://github.com/apache/spark/pull/34677#issuecomment-976491507


   **[Test build #145547 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145547/testReport)**
 for PR 34677 at commit 
[`0b67651`](https://github.com/apache/spark/commit/0b6765150798799a418d39209b4e5f6d4a16276e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `class SQLStringFormatter(string.Formatter):`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] thejdeep commented on a change in pull request #34607: [SPARK-36038][CORE] Speculation metrics summary at stage level

2021-11-23 Thread GitBox



thejdeep commented on a change in pull request #34607:
URL: https://github.com/apache/spark/pull/34607#discussion_r755122578



##
File path: core/src/main/scala/org/apache/spark/status/storeTypes.scala
##
@@ -399,6 +399,20 @@ private[spark] class ExecutorStageSummaryWrapper(
 
 }
 
+private[spark] class SpeculationStageSummaryWrapper(
+val stageId: Int,
+val stageAttemptId: Int,
+val info: SpeculationStageSummary) {
+
+  @JsonIgnore @KVIndex
+  private val _id: Array[Int] = Array(stageId, stageAttemptId)
+
+  @JsonIgnore @KVIndex("stage")
+  private def stage: Array[Int] = Array(stageId, stageAttemptId)
+
+  private[this] val id: Array[Int] = _id

Review comment:
   KVStore API requires an `id` but this was implemented just for 
convenience over class members and API constructs. I can remove `_id`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



SparkQA commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-976553555


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50021/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755161598



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
##
@@ -194,10 +195,22 @@ case class FileSourceScanExec(
 disableBucketedScan: Boolean = false)
   extends DataSourceScanExec {
 
+  lazy val outputMetadataStruct: Option[AttributeReference] =

Review comment:
   nit: it's probably better to use a noun here, how about 
`metadataStructCol`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sarutak commented on pull request #34607: [SPARK-36038][CORE] Speculation metrics summary at stage level

2021-11-23 Thread GitBox



sarutak commented on pull request #34607:
URL: https://github.com/apache/spark/pull/34607#issuecomment-976610243


   > since it would require changing a class val to a var.
   
   What's the problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755175121



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -103,6 +116,135 @@ class FileScanRDD(
 context.killTaskIfInterrupted()
 (currentIterator != null && currentIterator.hasNext) || nextIterator()
   }
+
+  ///
+  // FILE METADATA METHODS //
+  ///
+
+  // whether a metadata column exists and it is a `MetadataAttribute`
+  private lazy val hasMetadataAttribute: Boolean = {
+metadataStruct.exists {
+  case MetadataAttribute(_) => true
+  case _ => false
+}
+  }
+
+  // metadata struct unsafe row, will only be updated when the current 
file is changed
+  @volatile private var metadataStructUnsafeRow: UnsafeRow = _
+  // metadata generic row, will only be updated when the current file is 
changed
+  @volatile private var metadataStructGenericRow: Row = _
+  // an unsafe joiner to join an unsafe row with the metadata unsafe row
+  lazy private val unsafeRowJoiner =
+if (hasMetadataAttribute)
+  GenerateUnsafeRowJoiner.create(requiredSchema, 
Seq(metadataStruct.get).toStructType)
+
+  // Create a off/on heap WritableColumnVector
+  private def createColumnVector(numRows: Int, dataType: DataType): 
WritableColumnVector = {
+if (offHeapColumnVectorEnabled) {
+  new OffHeapColumnVector(numRows, dataType)
+} else {
+  new OnHeapColumnVector(numRows, dataType)
+}
+  }
+
+  /**
+   * For each partitioned file, metadata columns for each record in the 
file are exactly same.
+   * Only update metadata columns when `currentFile` is changed.
+   */
+  private def updateMetadataStruct(): Unit = {
+if (hasMetadataAttribute) {
+  val meta = metadataStruct.get
+  if (currentFile == null) {
+metadataStructUnsafeRow = new UnsafeRow(1)
+metadataStructGenericRow = new GenericRow(1)
+  } else {
+// make an generic row
+assert(meta.dataType.isInstanceOf[StructType])
+metadataStructGenericRow = Row.fromSeq(
+  meta.dataType.asInstanceOf[StructType].names.map {
+case FILE_PATH => UTF8String.fromString(new 
File(currentFile.filePath).toString)

Review comment:
   Why can't we use `filePath` directly? `new File` looks buggy as it won't 
work for `s3://` or other file systems.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34504: [SPARK-37226][SQL] Filter push down through window if partitionSpec isEmpty

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34504:
URL: https://github.com/apache/spark/pull/34504#discussion_r755210674



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -1548,6 +1548,31 @@ object PushPredicateThroughNonJoin extends 
Rule[LogicalPlan] with PredicateHelpe
 filter
   }
 
+case filter @ Filter(condition, w: Window) if w.partitionSpec.isEmpty =>
+  w.windowExpressions match {
+case Seq(alias @ Alias(WindowExpression(_: RowNumber, 
WindowSpecDefinition(Nil, orderSpec,
+SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), 
_)) =>

Review comment:
   I think this optimization should also apply if there are more than one 
window expressions. The algorithm should simply be
   ```
   val limits = w.windowExpressions.map {
 case row_number_func => calcuate the limit
 case _ => none
   }
   val limit = limits.min
   add limit + sort
   ``` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-23 Thread GitBox



SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-976763063


   **[Test build #145550 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145550/testReport)**
 for PR 34671 at commit 
[`61fe9fb`](https://github.com/apache/spark/commit/61fe9fb6bc09863aa74ab82ac5d653e2b922ae0a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-23 Thread GitBox



SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-976826549


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34677: [SPARK-37436][PYTHON] Uses Python's standard string formatter for SQL API in pandas API on Spark

2021-11-23 Thread GitBox



SparkQA removed a comment on pull request #34677:
URL: https://github.com/apache/spark/pull/34677#issuecomment-976356000


   **[Test build #145547 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145547/testReport)**
 for PR 34677 at commit 
[`0b67651`](https://github.com/apache/spark/commit/0b6765150798799a418d39209b4e5f6d4a16276e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



SparkQA commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-976492795


   **[Test build #145549 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145549/testReport)**
 for PR 34668 at commit 
[`24d3046`](https://github.com/apache/spark/commit/24d3046b5c3c58c2e2d2edf7a9f6fa095ba75682).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-976570563


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50020/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34691: [SPARK-37447][SQL] Cache LogicalPlan.isStreaming() result in a lazy val

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34691:
URL: https://github.com/apache/spark/pull/34691#issuecomment-976570562


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



SparkQA commented on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-976568407


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50020/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34668:
URL: https://github.com/apache/spark/pull/34668#discussion_r755144886



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala
##
@@ -78,9 +79,30 @@ abstract class AbstractSqlParser extends ParserInterface 
with SQLConfHelper with
 astBuilder.visitQuery(parser.query())
   }
 
+  /** check `has_unclosed_bracketed_comment` to find out the unclosed 
bracketed comment. */
+  private def singleStatementWithCheck(
+  parser: SqlBaseParser, sqlText: String): SingleStatementContext = {
+val singleStatementContext = parser.singleStatement()
+assert(parser.getTokenStream.isInstanceOf[CommonTokenStream])
+
+val tokenStream = parser.getTokenStream.asInstanceOf[CommonTokenStream]
+assert(tokenStream.getTokenSource.isInstanceOf[SqlBaseLexer])
+
+val lexer = tokenStream.getTokenSource.asInstanceOf[SqlBaseLexer]
+if (lexer.has_unclosed_bracketed_comment) {
+  // The last token is 'EOF' and the penultimate is unclosed bracketed 
comment
+  val failedToken = tokenStream.get(tokenStream.size() - 2)
+  assert(failedToken.getType() == SqlBaseParser.BRACKETED_COMMENT)
+  val position = Origin(Option(failedToken.getLine), 
Option(failedToken.getCharPositionInLine))
+  throw QueryParsingErrors.unclosedBracketedCommentError(sqlText, position)
+}
+
+singleStatementContext
+  }
+
   /** Creates LogicalPlan for a given SQL string. */
   override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { 
parser =>
-astBuilder.visitSingleStatement(parser.singleStatement()) match {
+astBuilder.visitSingleStatement(singleStatementWithCheck(parser, sqlText)) 
match {

Review comment:
   does this work for `parseExpression` and others?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755169306



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -57,11 +66,15 @@ case class PartitionedFile(
 class FileScanRDD(
 @transient private val sparkSession: SparkSession,
 readFunction: (PartitionedFile) => Iterator[InternalRow],
-@transient val filePartitions: Seq[FilePartition])
+@transient val filePartitions: Seq[FilePartition],
+val requiredSchema: StructType = StructType(Seq.empty),

Review comment:
   nit: `dataSchema` is probably a better name in this context. We need to 
join the data rows with metadata col w.r.t. this schema.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755169884



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -57,11 +66,15 @@ case class PartitionedFile(
 class FileScanRDD(
 @transient private val sparkSession: SparkSession,
 readFunction: (PartitionedFile) => Iterator[InternalRow],
-@transient val filePartitions: Seq[FilePartition])
+@transient val filePartitions: Seq[FilePartition],
+val requiredSchema: StructType = StructType(Seq.empty),
+val metadataStruct: Option[AttributeReference] = None)

Review comment:
   nit: `metadataStructCol`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755176293



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -103,6 +116,135 @@ class FileScanRDD(
 context.killTaskIfInterrupted()
 (currentIterator != null && currentIterator.hasNext) || nextIterator()
   }
+
+  ///
+  // FILE METADATA METHODS //
+  ///
+
+  // whether a metadata column exists and it is a `MetadataAttribute`
+  private lazy val hasMetadataAttribute: Boolean = {
+metadataStruct.exists {
+  case MetadataAttribute(_) => true
+  case _ => false
+}
+  }
+
+  // metadata struct unsafe row, will only be updated when the current 
file is changed
+  @volatile private var metadataStructUnsafeRow: UnsafeRow = _
+  // metadata generic row, will only be updated when the current file is 
changed
+  @volatile private var metadataStructGenericRow: Row = _
+  // an unsafe joiner to join an unsafe row with the metadata unsafe row
+  lazy private val unsafeRowJoiner =
+if (hasMetadataAttribute)
+  GenerateUnsafeRowJoiner.create(requiredSchema, 
Seq(metadataStruct.get).toStructType)
+
+  // Create a off/on heap WritableColumnVector
+  private def createColumnVector(numRows: Int, dataType: DataType): 
WritableColumnVector = {
+if (offHeapColumnVectorEnabled) {
+  new OffHeapColumnVector(numRows, dataType)
+} else {
+  new OnHeapColumnVector(numRows, dataType)
+}
+  }
+
+  /**
+   * For each partitioned file, metadata columns for each record in the 
file are exactly same.
+   * Only update metadata columns when `currentFile` is changed.
+   */
+  private def updateMetadataStruct(): Unit = {
+if (hasMetadataAttribute) {
+  val meta = metadataStruct.get
+  if (currentFile == null) {
+metadataStructUnsafeRow = new UnsafeRow(1)
+metadataStructGenericRow = new GenericRow(1)
+  } else {
+// make an generic row
+assert(meta.dataType.isInstanceOf[StructType])
+metadataStructGenericRow = Row.fromSeq(
+  meta.dataType.asInstanceOf[StructType].names.map {
+case FILE_PATH => UTF8String.fromString(new 
File(currentFile.filePath).toString)
+case FILE_NAME => UTF8String.fromString(
+  currentFile.filePath.split("/").last)

Review comment:
   path separator is not always `/`. Let's use `java.io.File.separator`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755181619



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
##
@@ -103,6 +116,135 @@ class FileScanRDD(
 context.killTaskIfInterrupted()
 (currentIterator != null && currentIterator.hasNext) || nextIterator()
   }
+
+  ///
+  // FILE METADATA METHODS //
+  ///
+
+  // whether a metadata column exists and it is a `MetadataAttribute`
+  private lazy val hasMetadataAttribute: Boolean = {
+metadataStruct.exists {
+  case MetadataAttribute(_) => true
+  case _ => false
+}
+  }
+
+  // metadata struct unsafe row, will only be updated when the current 
file is changed
+  @volatile private var metadataStructUnsafeRow: UnsafeRow = _
+  // metadata generic row, will only be updated when the current file is 
changed
+  @volatile private var metadataStructGenericRow: Row = _
+  // an unsafe joiner to join an unsafe row with the metadata unsafe row
+  lazy private val unsafeRowJoiner =
+if (hasMetadataAttribute)
+  GenerateUnsafeRowJoiner.create(requiredSchema, 
Seq(metadataStruct.get).toStructType)
+
+  // Create a off/on heap WritableColumnVector
+  private def createColumnVector(numRows: Int, dataType: DataType): 
WritableColumnVector = {
+if (offHeapColumnVectorEnabled) {
+  new OffHeapColumnVector(numRows, dataType)
+} else {
+  new OnHeapColumnVector(numRows, dataType)
+}
+  }
+
+  /**
+   * For each partitioned file, metadata columns for each record in the 
file are exactly same.
+   * Only update metadata columns when `currentFile` is changed.
+   */
+  private def updateMetadataStruct(): Unit = {
+if (hasMetadataAttribute) {
+  val meta = metadataStruct.get
+  if (currentFile == null) {
+metadataStructUnsafeRow = new UnsafeRow(1)
+metadataStructGenericRow = new GenericRow(1)
+  } else {
+// make an generic row
+assert(meta.dataType.isInstanceOf[StructType])
+metadataStructGenericRow = Row.fromSeq(

Review comment:
   Why do we create Row here and then use `CatalystTypeConverters` later? 
Let's just use `InternalRow.fromSeq` here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34575: [SPARK-37273][SQL] Support hidden file metadata columns in Spark SQL

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34575:
URL: https://github.com/apache/spark/pull/34575#discussion_r755185991



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
##
@@ -212,7 +212,9 @@ object FileSourceStrategy extends Strategy with 
PredicateHelper with Logging {
   val outputSchema = readDataColumns.toStructType
   logInfo(s"Output Data Schema: ${outputSchema.simpleString(5)}")
 
-  val outputAttributes = readDataColumns ++ partitionColumns
+  // outputAttributes should also include referenced metadata columns at 
the every end
+  val outputAttributes = readDataColumns ++ partitionColumns ++
+plan.references.collect { case MetadataAttribute(attr) => attr }

Review comment:
   shall we use `requiredAttributes` instead of `plan.references`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



SparkQA commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-976633320


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50021/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-97983


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50021/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #34689: [SPARK-37445][BUILD] Upgrade hadoop profile to hadoop-3.3 since we support hadoop-3.3 as default now

2021-11-23 Thread GitBox



srowen commented on a change in pull request #34689:
URL: https://github.com/apache/spark/pull/34689#discussion_r755239983



##
File path: hadoop-cloud/pom.xml
##
@@ -201,7 +201,7 @@
  enables store-specific committers.
 -->
 
-  hadoop-3.2
+  hadoop-3.3

Review comment:
   Same comment as hive-2.3, but, here may be better to change the name to 
make sure callers are clear it's not Hadoop 3.2.
   
   How about avoiding this going forward with `hadoop-3`? I doubt we're going 
to publish builds separately for two different Hadoop 3 versions in the future.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34634: [SPARK-37357][SQL] Create skew partition specs should respect min partition size

2021-11-23 Thread GitBox



cloud-fan commented on a change in pull request #34634:
URL: https://github.com/apache/spark/pull/34634#discussion_r755239272



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala
##
@@ -316,21 +316,25 @@ object ShufflePartitionsUtil extends Logging {
* start of a partition.
*/
   // Visible for testing
-  private[sql] def splitSizeListByTargetSize(sizes: Seq[Long], targetSize: 
Long): Array[Int] = {
+  private[sql] def splitSizeListByTargetSize(
+  sizes: Seq[Long],
+  targetSize: Long,
+  minPartitionSize: Long): Array[Int] = {
 val partitionStartIndices = ArrayBuffer[Int]()
 partitionStartIndices += 0
 var i = 0
 var currentPartitionSize = 0L
 var lastPartitionSize = -1L
+val minSize =
+  Math.min(targetSize, Math.max(minPartitionSize, targetSize * 
SMALL_PARTITION_FACTOR))
 
 def tryMergePartitions() = {
   // When we are going to start a new partition, it's possible that the 
current partition or
   // the previous partition is very small and it's better to merge the 
current partition into
   // the previous partition.
   val shouldMergePartitions = lastPartitionSize > -1 &&
 ((currentPartitionSize + lastPartitionSize) < targetSize * 
MERGED_PARTITION_FACTOR ||

Review comment:
   I see a conflict here:
   1. we want to make each partition smaller than `targetSize * 
MERGED_PARTITION_FACTOR`
   2. we want to make each partition larger than `targetSize * 
SMALL_PARTITION_FACTOR`
   
   Seems like we prioritize 2 over 1, but in some cases 
`SMALL_PARTITION_FACTOR` needs to be smaller.
   
   How about we make `SMALL_PARTITION_FACTOR` a parameter of the function? Then 
we can pass different values from skew join rule and rebalance rule. e.g. The 
value can be `0.1` in the rebalance rule, or even making it configurable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] peter-toth edited a comment on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



peter-toth edited a comment on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-97672


   This change also seem to work with MSSQL's temp table syntax:
   ```
   val withClause = "(SELECT * INTO #TempTable FROM (SELECT * FROM tbl WHERE x 
> 10) t)"
   val query = "SELECT * FROM #TempTable"
   val df = spark.read.format("jdbc")
 .option("url", jdbcUrl)
 .option("withClause", withClause)
 .option("query", query)
 .load()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs commented on pull request #34622: [SPARK-37340][UI] Display StageIds in Operators for SQL UI

2021-11-23 Thread GitBox



tgravescs commented on pull request #34622:
URL: https://github.com/apache/spark/pull/34622#issuecomment-976749417


   yes, it would be nice to have the actual stagIds in the ui, I'll need to 
look closer at the logic though, which likely won't be til next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-23 Thread GitBox



SparkQA removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-976763063


   **[Test build #145550 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145550/testReport)**
 for PR 34671 at commit 
[`61fe9fb`](https://github.com/apache/spark/commit/61fe9fb6bc09863aa74ab82ac5d653e2b922ae0a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tdg5 commented on a change in pull request #29024: [SPARK-32001][SQL]Create JDBC authentication provider developer API

2021-11-23 Thread GitBox



tdg5 commented on a change in pull request #29024:
URL: https://github.com/apache/spark/pull/29024#discussion_r755337180



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProvider.scala
##
@@ -18,60 +18,45 @@
 package org.apache.spark.sql.execution.datasources.jdbc.connection
 
 import java.sql.{Connection, Driver}
-import java.util.Properties
+import java.util.ServiceLoader
 
-import org.apache.spark.internal.Logging
-import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
-
-/**
- * Connection provider which opens connection toward various databases 
(database specific instance
- * needed). If kerberos authentication required then it's the provider's 
responsibility to set all
- * the parameters.
- */
-private[jdbc] trait ConnectionProvider {
-  /**
-   * Additional properties for data connection (Data source property takes 
precedence).
-   */
-  def getAdditionalProperties(): Properties = new Properties()
+import scala.collection.mutable
 
-  /**
-   * Opens connection toward the database.
-   */
-  def getConnection(): Connection
-}
+import org.apache.spark.internal.Logging
+import org.apache.spark.security.SecurityConfigurationLock
+import org.apache.spark.sql.jdbc.JdbcConnectionProvider
+import org.apache.spark.util.Utils
 
 private[jdbc] object ConnectionProvider extends Logging {
-  def create(driver: Driver, options: JDBCOptions): ConnectionProvider = {
-if (options.keytab == null || options.principal == null) {
-  logDebug("No authentication configuration found, using basic connection 
provider")
-  new BasicConnectionProvider(driver, options)
-} else {
-  logDebug("Authentication configuration found, using database specific 
connection provider")
-  options.driverClass match {
-case PostgresConnectionProvider.driverClass =>
-  logDebug("Postgres connection provider found")
-  new PostgresConnectionProvider(driver, options)
-
-case MariaDBConnectionProvider.driverClass =>
-  logDebug("MariaDB connection provider found")
-  new MariaDBConnectionProvider(driver, options)
-
-case DB2ConnectionProvider.driverClass =>
-  logDebug("DB2 connection provider found")
-  new DB2ConnectionProvider(driver, options)
-
-case MSSQLConnectionProvider.driverClass =>
-  logDebug("MS SQL connection provider found")
-  new MSSQLConnectionProvider(driver, options)
-
-case OracleConnectionProvider.driverClass =>
-  logDebug("Oracle connection provider found")
-  new OracleConnectionProvider(driver, options)
-
-case _ =>
-  throw new IllegalArgumentException(s"Driver ${options.driverClass} 
does not support " +
-"Kerberos authentication")
+  private val providers = loadProviders()
+
+  def loadProviders(): Seq[JdbcConnectionProvider] = {
+val loader = ServiceLoader.load(classOf[JdbcConnectionProvider],
+  Utils.getContextOrSparkClassLoader)
+val providers = mutable.ArrayBuffer[JdbcConnectionProvider]()
+
+val iterator = loader.iterator
+while (iterator.hasNext) {
+  try {
+val provider = iterator.next
+logDebug(s"Loaded built in provider: $provider")
+providers += provider
+  } catch {
+case t: Throwable =>
+  logError(s"Failed to load built in provider.", t)
   }
 }
+// Seems duplicate but it's needed for Scala 2.13
+providers.toSeq
+  }
+
+  def create(driver: Driver, options: Map[String, String]): Connection = {
+val filteredProviders = providers.filter(_.canHandle(driver, options))
+require(filteredProviders.size == 1,
+  "JDBC connection initiated but not exactly one connection provider found 
which can handle " +
+s"it. Found active providers: ${filteredProviders.mkString(", ")}")
+SecurityConfigurationLock.synchronized {

Review comment:
   I don't know anything about the underlying problem, but it had crossed 
my mind that allowing the synchronization to be optional could be one path 
forward.
   
   @gaborgsomogyi may I trouble you for some references related to
   > The main problem is that JVM has a single security context. In order to 
make a connection one MUST modify the security context, otherwise the JDBC 
connector is not able to provide the security credentials (TGT, keytab or 
whatever)
   
   so I can better familiarize myself with the problem that you are describing?
   
   Beyond this particular issue, what you've shared suggests that the 
concurrency utilized by my app could be causing us to crosswire data which 
would be a major problem.
   
   I guess I'd also ask, is there more to it than you described? It sounds like 
I should either have some cross wired data or if that's not the case then there 
is some missing piece of the puzzle that means the synchronization is not 
always required.
   
   Thanks in advance!




--

[GitHub] [spark] ChenMichael edited a comment on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled

2021-11-23 Thread GitBox



ChenMichael edited a comment on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-976849576


   In order for this problem to manifest, we have to do join planning in 
between the time an InMemoryRelation is converted to a rdd and the time where 
the job executing this rdd completes. In AQE, since it repeatedly does 
replanning, this can happen when the InMemoryRelations are on different levels. 
With AQE off, it only does join planning once, so there's no scenario where 
part of the query materializes the InMemoryRelation and then join planning 
happens on another part of the query with inaccurate stats. I guess this could 
happen with AQE off if there are concurrent jobs sharing the same 
InMemoryRelation though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34668: [SPARK-37389][SQL] Check unclosed bracketed comments

2021-11-23 Thread GitBox



SparkQA commented on pull request #34668:
URL: https://github.com/apache/spark/pull/34668#issuecomment-976912054


   **[Test build #145549 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145549/testReport)**
 for PR 34668 at commit 
[`24d3046`](https://github.com/apache/spark/commit/24d3046b5c3c58c2e2d2edf7a9f6fa095ba75682).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-976940851


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source

2021-11-23 Thread GitBox



sadikovi commented on a change in pull request #34596:
URL: https://github.com/apache/spark/pull/34596#discussion_r755539240



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
##
@@ -442,17 +442,22 @@ object DateTimeUtils {
 
   /**
* Trims and parses a given UTF8 string to a corresponding [[Long]] value 
which representing the
-   * number of microseconds since the epoch. The result is independent of time 
zones,
-   * which means that zone ID in the input string will be ignored.
+   * number of microseconds since the epoch. The result will be independent of 
time zones.
+   *
+   * If the input string contains a component associated with time zone, the 
method will return
+   * `None` if `ignoreTimeZone` is set to `false`. If `ignoreTimeZone` is set 
to `true`, the method
+   * will simply discard the time zone component. Enable the check to detect 
situations like parsing
+   * a timestamp with time zone as TimestampNTZType.
+   *
* The return type is [[Option]] in order to distinguish between 0L and 
null. Please
* refer to `parseTimestampString` for the allowed formats.
*/
-  def stringToTimestampWithoutTimeZone(s: UTF8String): Option[Long] = {
+  def stringToTimestampWithoutTimeZone(s: UTF8String, ignoreTimeZone: 
Boolean): Option[Long] = {

Review comment:
   I understand, but we don't fail if ignoreTimeZone was set to `false`, we 
return an Option instead. I can update in TimestampFormatter but `failOnError` 
in `stringToTimestampWithoutTimeZone` might be a bit misleading. Let me know 
what you think. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down

2021-11-23 Thread GitBox



SparkQA commented on pull request #34593:
URL: https://github.com/apache/spark/pull/34593#issuecomment-977249138


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50023/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down

2021-11-23 Thread GitBox



SparkQA commented on pull request #34593:
URL: https://github.com/apache/spark/pull/34593#issuecomment-977252277


   **[Test build #145552 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145552/testReport)**
 for PR 34593 at commit 
[`ed83e8b`](https://github.com/apache/spark/commit/ed83e8b10196c97b1f04d614af9233fd8579b148).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ChenMichael edited a comment on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled

2021-11-23 Thread GitBox



ChenMichael edited a comment on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-976849576


   In order for this problem to manifest, we have to do join planning between 
the time a InMemoryRelation is converted to an rdd and the time where the job 
executing this rdd completes. In AQE, since it repeatedly does replanning, this 
can happen when the InMemoryRelations are on different levels. I guess this 
could happen with AQE off if there are concurrent jobs sharing the same 
InMemoryRelation as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ChenMichael edited a comment on pull request #34684: [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled

2021-11-23 Thread GitBox



ChenMichael edited a comment on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-976849576


   In order for this problem to manifest, we have to do join planning in 
between the time an InMemoryRelation is converted to a rdd and the time where 
the job executing this rdd completes. In AQE, since it repeatedly does 
replanning, this can happen when the InMemoryRelations are on different levels. 
I guess this could happen with AQE off if there are concurrent jobs sharing the 
same InMemoryRelation as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-23 Thread GitBox



SparkQA commented on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-976884853


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source

2021-11-23 Thread GitBox



sadikovi commented on a change in pull request #34596:
URL: https://github.com/apache/spark/pull/34596#discussion_r755538280



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##
@@ -164,6 +164,10 @@ class CSVOptions(
   s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]"
 })
 
+  val timestampNTZFormatInRead: Option[String] = 
parameters.get("timestampNTZFormat")

Review comment:
   How do you mean? Actually, it is the same config but reads require an 
`Option[String]` and write require `String` (similar to `timestampFormatInRead` 
and `timestampFormatInWrite`). I can update to keep only one property 
`timestampNTZFormat` instead. Would that work?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on a change in pull request #34596: [SPARK-37326][SQL] Support TimestampNTZ in CSV data source

2021-11-23 Thread GitBox



sadikovi commented on a change in pull request #34596:
URL: https://github.com/apache/spark/pull/34596#discussion_r755538940



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
##
@@ -66,10 +68,23 @@ sealed trait TimestampFormatter extends Serializable {
   @throws(classOf[DateTimeParseException])
   @throws(classOf[DateTimeException])
   @throws(classOf[IllegalStateException])
-  def parseWithoutTimeZone(s: String): Long =
+  def parseWithoutTimeZone(s: String, ignoreTimeZone: Boolean): Long =

Review comment:
   Yes, I will update.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down

2021-11-23 Thread GitBox



SparkQA commented on pull request #34593:
URL: https://github.com/apache/spark/pull/34593#issuecomment-977218053


   **[Test build #145551 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145551/testReport)**
 for PR 34593 at commit 
[`b7c2cc6`](https://github.com/apache/spark/commit/b7c2cc608f96f362147f6518e7bc596794e2948b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34671: [SPARK-37399][SPARK-37403][PySpark][ML] Merge {ml, mllib}/common.pyi into common.py

2021-11-23 Thread GitBox



AmplabJenkins removed a comment on pull request #34671:
URL: https://github.com/apache/spark/pull/34671#issuecomment-976848362


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145550/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34693: [SPARK-37259][SQL] Support CTE queries with MSSQL JDBC

2021-11-23 Thread GitBox



AmplabJenkins removed a comment on pull request #34693:
URL: https://github.com/apache/spark/pull/34693#issuecomment-976848358


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145548/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on pull request #34642: [SPARK-37369][SQL] Avoid redundant ColumnarToRow transistion on InMemoryTableScan

2021-11-23 Thread GitBox



viirya commented on pull request #34642:
URL: https://github.com/apache/spark/pull/34642#issuecomment-977016485


   > I'm trying to understand the motivation. Is it because in-memory table can 
output rows efficiently? Parquet scan can also output rows but we try our best 
to output columnar batches.
   
   For Parquet scan, when we say it to output columnar batches, actually it 
behaves quite different than row-based approach because it runs vectorized 
Parquet reader. I think this is why we try our best to do columnar batches on 
Parquet or Orc scan because vectorized reader usually has much better 
performance which can counteract the cost of columnar-to-row transition if any 
later.
   
   For in-memory table, it is not actually doing a physical disk scan but the 
data is already serialized in memory. The motivation is that during local 
experiments I found columnar-to-row transition is costly and the columnar 
output looks meaningless.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34386: [WIP] - Changes to PySpark doc homepage and User Guide

2021-11-23 Thread GitBox



SparkQA commented on pull request #34386:
URL: https://github.com/apache/spark/pull/34386#issuecomment-977462625


   **[Test build #145553 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145553/testReport)**
 for PR 34386 at commit 
[`4a22e48`](https://github.com/apache/spark/commit/4a22e485119c01af67d26ebd85ff39451d99bf54).
* This patch **fails PySpark unit tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34386: [WIP] - Changes to PySpark doc homepage and User Guide

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34386:
URL: https://github.com/apache/spark/pull/34386#issuecomment-977463411


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145553/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #34689: [SPARK-37445][BUILD] Upgrade hadoop profile to hadoop-3.3 since we support hadoop-3.3 as default now

2021-11-23 Thread GitBox



srowen commented on a change in pull request #34689:
URL: https://github.com/apache/spark/pull/34689#discussion_r755671589



##
File path: hadoop-cloud/pom.xml
##
@@ -201,7 +201,7 @@
  enables store-specific committers.
 -->
 
-  hadoop-3.2
+  hadoop-3.3

Review comment:
   I see, are we going to need a separate Hadoop 3.2 build at all going 
forward? If not, I think it makes sense to call it hadoop-3.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34593: [SPARK-37324][SQL] Adds support for decimal rounding mode up, down, half_down

2021-11-23 Thread GitBox



AmplabJenkins commented on pull request #34593:
URL: https://github.com/apache/spark/pull/34593#issuecomment-977464156


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145551/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34679: [SPARK-37437][BUILD] Remove unused hive profile

2021-11-23 Thread GitBox



AngersZh commented on a change in pull request #34679:
URL: https://github.com/apache/spark/pull/34679#discussion_r755672634



##
File path: pom.xml
##
@@ -3353,11 +3353,6 @@
   
 
 
-

Review comment:
   > We could possibly leave the profile in and have it still do nothing, 
as if the profile doesn't exist, a build with -Phive-2.3 fails, even though 
it's really "OK" - it's already hive 2.3 by default. I don't feel strongly 
about it.
   
   you mean `test-hive2.3`? I think this won't work now and we don't need this 
too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34679: [SPARK-37437][BUILD] Remove unused hive profile

2021-11-23 Thread GitBox



AngersZh commented on a change in pull request #34679:
URL: https://github.com/apache/spark/pull/34679#discussion_r755674382



##
File path: pom.xml
##
@@ -3353,11 +3353,6 @@
   
 
 
-

Review comment:
   > No, I just mean do not remove the profile, so that specifying it does 
not fail the build
   
   But build with a `-Phive-2.3` won't failed too...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 >

101 - 200 of 390 matches

Mail list logo