[GitHub] [spark] SparkQA removed a comment on pull request #30212: [SPARK-33308][SQL] Refract current grouping analytics

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #30212:
URL: https://github.com/apache/spark/pull/30212#issuecomment-735354809


   **[Test build #131924 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131924/testReport)**
 for PR 30212 at commit 
[`7e267b8`](https://github.com/apache/spark/commit/7e267b85538da120fd8265a91861b276b631febc).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30212: [SPARK-33308][SQL] Refract current grouping analytics

2020-11-28 Thread GitBox


SparkQA commented on pull request #30212:
URL: https://github.com/apache/spark/pull/30212#issuecomment-735357230


   **[Test build #131924 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131924/testReport)**
 for PR 30212 at commit 
[`7e267b8`](https://github.com/apache/spark/commit/7e267b85538da120fd8265a91861b276b631febc).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #30212: [SPARK-33308][SQL] Refract current grouping analytics

2020-11-28 Thread GitBox


maropu commented on a change in pull request #30212:
URL: https://github.com/apache/spark/pull/30212#discussion_r532169671



##
File path: sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql
##
@@ -18,12 +18,19 @@ AS courseSales(course, year, earnings);
 
 -- ROLLUP
 SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP (course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year, (course, year)) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year, (course, year), ()) ORDER BY course, year;
 
 -- CUBE
 SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE (course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year, (course, year)) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year, (course, year), ()) ORDER BY course, year;

Review comment:
   Okay, thanks for the check. So, could you follow the PostgreSQL/Oracle 
behaviour? Let's revisit it if other reviewers leave comments about the 
behaviour.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30212: [SPARK-33308][SQL] Refract current grouping analytics

2020-11-28 Thread GitBox


AngersZh commented on a change in pull request #30212:
URL: https://github.com/apache/spark/pull/30212#discussion_r532169398



##
File path: sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql
##
@@ -18,12 +18,19 @@ AS courseSales(course, year, earnings);
 
 -- ROLLUP
 SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP (course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year, (course, year)) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year, (course, year), ()) ORDER BY course, year;
 
 -- CUBE
 SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE (course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year, (course, year)) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year, (course, year), ()) ORDER BY course, year;

Review comment:
   And which place should we prevent this illegal case?
1. In AstBuilder throw a ParseException 
2. In Analyzer  throw a AnalysisException
3. in Cube/Rollup class with a assert ` 
assert(groupingSets.forall(_.nonEmpty))`
   
   IMO we should throw ParseExceptionin AstBuilder





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30529: [SPARK-33588][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SHOW TABLE EXTENDED`

2020-11-28 Thread GitBox


SparkQA commented on pull request #30529:
URL: https://github.com/apache/spark/pull/30529#issuecomment-735356142


   **[Test build #131925 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131925/testReport)**
 for PR 30529 at commit 
[`aeceb63`](https://github.com/apache/spark/commit/aeceb63ee1bcc2550f2ed31c9a3f97c4eaac8469).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30212: [SPARK-33308][SQL] Refract current grouping analytics

2020-11-28 Thread GitBox


AngersZh commented on a change in pull request #30212:
URL: https://github.com/apache/spark/pull/30212#discussion_r532169014



##
File path: sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql
##
@@ -18,12 +18,19 @@ AS courseSales(course, year, earnings);
 
 -- ROLLUP
 SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP (course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year, (course, year)) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, 
year, (course, year), ()) ORDER BY course, year;
 
 -- CUBE
 SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE (course, 
year) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year, (course, year)) ORDER BY course, year;
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, 
year, (course, year), ()) ORDER BY course, year;

Review comment:
   > Could you check the behaviour of other systems (other than postgresql) 
when having an empty set in cube/rollup?
   
   Test in Oracle, it not support empty set in groupingset of cube/rollup too.
   
![image](https://user-images.githubusercontent.com/46485123/100535966-681a5880-3258-11eb-84da-849266ac02fe.png)
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #30334: [SPARK-33411][SQL] Cardinality estimation of union, sort and range operator

2020-11-28 Thread GitBox


maropu commented on a change in pull request #30334:
URL: https://github.com/apache/spark/pull/30334#discussion_r532168676



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/UnionEstimation.scala
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Statistics, 
Union}
+import org.apache.spark.sql.types.{ByteType, DataType, DateType, DecimalType, 
DoubleType, FloatType, IntegerType, LongType, ShortType, TimestampType}
+
+/**
+ * Estimate the number of output rows by doing the sum of output rows for each 
child of union,
+ * and estimate min and max stats for each column by finding the overall min 
and max of that
+ * column coming from its children.
+ */
+object UnionEstimation {
+  import EstimationUtils._
+
+  def compare(a: Any, b: Any, dataType: DataType): Boolean = {
+dataType match {
+  case dt: IntegerType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: LongType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: FloatType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: DoubleType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: ShortType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: ByteType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: DateType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: TimestampType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, 
b)
+  case dt: DecimalType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case _ => false
+}
+  }
+
+  def estimate(union: Union): Option[Statistics] = {
+val sizeInBytes = union.children.map(_.stats.sizeInBytes).sum
+val outputRows: Option[BigInt] = if (rowCountsExist(union.children: _*)) {
+  Some(union.children.map(_.stats.rowCount.get).sum)
+} else {
+  None
+}
+
+val output = union.output
+val outputAttrStats = new ArrayBuffer[(Attribute, ColumnStat)]()
+
+union.children.map(_.output).transpose.zipWithIndex.foreach {
+  case (attrs, outputIndex) =>
+val validStat = attrs.zipWithIndex.forall {
+  case (attr, childIndex) =>
+val attrStats = union.children(childIndex).stats.attributeStats
+attrStats.get(attr).isDefined && attrStats(attr).hasMinMaxStats
+}
+if (validStat) {
+  val dataType = output(outputIndex).dataType
+  val minStart: Option[Any] = None

Review comment:
   How about removing the intermediate variables like this?
   ```
   val minMaxValue = attrs.zipWithIndex.foldLeft[(Option[Any], 
Option[Any])]((None, None)) {
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #30334: [SPARK-33411][SQL] Cardinality estimation of union, sort and range operator

2020-11-28 Thread GitBox


maropu commented on a change in pull request #30334:
URL: https://github.com/apache/spark/pull/30334#discussion_r532045253



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/UnionEstimation.scala
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Statistics, 
Union}
+import org.apache.spark.sql.types.{ByteType, DataType, DateType, DecimalType, 
DoubleType, FloatType, IntegerType, LongType, ShortType, TimestampType}

Review comment:
   nit: `import org.apache.spark.sql.types._` looks okay.

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/UnionEstimation.scala
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical.statsEstimation
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap}
+import org.apache.spark.sql.catalyst.plans.logical.{ColumnStat, Statistics, 
Union}
+import org.apache.spark.sql.types.{ByteType, DataType, DateType, DecimalType, 
DoubleType, FloatType, IntegerType, LongType, ShortType, TimestampType}
+
+/**
+ * Estimate the number of output rows by doing the sum of output rows for each 
child of union,
+ * and estimate min and max stats for each column by finding the overall min 
and max of that
+ * column coming from its children.
+ */
+object UnionEstimation {
+  import EstimationUtils._
+
+  def compare(a: Any, b: Any, dataType: DataType): Boolean = {
+dataType match {
+  case dt: IntegerType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: LongType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: FloatType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: DoubleType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: ShortType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: ByteType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: DateType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case dt: TimestampType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, 
b)
+  case dt: DecimalType => dt.ordering.asInstanceOf[Ordering[Any]].lt(a, b)
+  case _ => false
+}
+  }
+
+  def estimate(union: Union): Option[Statistics] = {
+val sizeInBytes = union.children.map(_.stats.sizeInBytes).sum
+val outputRows: Option[BigInt] = if (rowCountsExist(union.children: _*)) {
+  Some(union.children.map(_.stats.rowCount.get).sum)
+} else {
+  None
+}
+
+val output = union.output
+val outputAttrStats = new ArrayBuffer[(Attribute, ColumnStat)]()
+
+union.children.map(_.output).transpose.zipWithIndex.foreach {
+  case (attrs, outputIndex) =>
+val validStat = attrs.zipWithIndex.forall {
+  case (attr, childIndex) =>
+val attrStats = union.children(childIndex).stats.attributeStats
+attrStats.get(attr).isDefined && attrStats(attr).hasMinMaxStats
+}
+if (validStat) {
+  val dataType = output(outputIndex).dataType
+  

[GitHub] [spark] gaborgsomogyi commented on pull request #29729: [SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsumer.poll(long) API

2020-11-28 Thread GitBox


gaborgsomogyi commented on pull request #29729:
URL: https://github.com/apache/spark/pull/29729#issuecomment-735355267


   @HeartSaVioR Thanks for the new round of review, it's really a huge change 
and appreciate your effort!
   I agree on your suggestions and will apply them starting from Monday.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30212: [SPARK-33308][SQL] Refract current grouping analytics

2020-11-28 Thread GitBox


SparkQA commented on pull request #30212:
URL: https://github.com/apache/spark/pull/30212#issuecomment-735354809


   **[Test build #131924 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131924/testReport)**
 for PR 30212 at commit 
[`7e267b8`](https://github.com/apache/spark/commit/7e267b85538da120fd8265a91861b276b631febc).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735354560







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735354560







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30212: [SPARK-33308][SQL] Refract current grouping analytics

2020-11-28 Thread GitBox


AngersZh commented on a change in pull request #30212:
URL: https://github.com/apache/spark/pull/30212#discussion_r532165123



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##
@@ -1529,7 +1475,25 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 }
 
 val resolvedGroupingExprs = a.groupingExpressions
-  .map(resolveExpressionTopDown(_, planForResolve, trimAlias = true))
+  .map {
+case c @ Cube(groupingSets) =>
+  c.copy(groupingSets =
+groupingSets.map(_.map(resolveExpressionTopDown(_, a, 
trimAlias = true))

Review comment:
   > Hm, I see. Since these extra entries for `GroupingSet` make the 
analyzer more complicated, could you find another approach for removing them? 
IIUC these entries are needed because `GroupingSet` holds `grpuingSets` as 
non-children exprs. For example, it seems `groupingSets` exprs and its children 
(`groupByExprs` exprs) looks duplicated, so is it possible to replace 
`groupingSets` exprs with integer indices to `groupByExprs` ?
   
   This complexed code make me confused too, and I have tried to find way to 
solve this but failed, thanks for your suggestion about use `integer indices `. 
I have change the code to convert to/from `groupingSets` and `children` to 
remove such, and it works.  How about current code?
   
   > `groupingSets` exprs and its children (`groupByExprs` exprs) looks 
duplicated, so is it possible to replace `groupingSets` exprs with integer 
indices to `groupByExprs` ?
   
   I think change code about about `groupByExprs` and `groupingSets` make it 
hard to understand. So I change to use relation about `children` and 
`groupjngSets`
   
   
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30212: [SPARK-33308][SQL] Refract current grouping analytics

2020-11-28 Thread GitBox


AngersZh commented on a change in pull request #30212:
URL: https://github.com/apache/spark/pull/30212#discussion_r532164002



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
##
@@ -39,45 +41,64 @@ trait GroupingSet extends Expression with CodegenFallback {
   override def eval(input: InternalRow): Any = throw new 
UnsupportedOperationException
 }
 
-// scalastyle:off line.size.limit line.contains.tab
-@ExpressionDescription(
-  usage = """
-_FUNC_([col1[, col2 ..]]) - create a multi-dimensional cube using the 
specified columns
-  so that we can run aggregation on them.
-  """,
-  examples = """
-Examples:
-  > SELECT name, age, count(*) FROM VALUES (2, 'Alice'), (5, 'Bob') 
people(age, name) GROUP BY _FUNC_(name, age);
-Bob5   1
-Alice  2   1
-Alice  NULL1
-NULL   2   1
-NULL   NULL2
-BobNULL1
-NULL   5   1
-  """,
-  since = "2.0.0")
-// scalastyle:on line.size.limit line.contains.tab
-case class Cube(groupByExprs: Seq[Expression]) extends GroupingSet {}
+object GroupingSet {
+  /*
+   *  GROUP BY a, b, c WITH ROLLUP

Review comment:
   > (I know this is not your mistake though) could you remove the single 
leading space?
   
   Done

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
##
@@ -39,45 +41,64 @@ trait GroupingSet extends Expression with CodegenFallback {
   override def eval(input: InternalRow): Any = throw new 
UnsupportedOperationException
 }
 
-// scalastyle:off line.size.limit line.contains.tab
-@ExpressionDescription(
-  usage = """
-_FUNC_([col1[, col2 ..]]) - create a multi-dimensional cube using the 
specified columns
-  so that we can run aggregation on them.
-  """,
-  examples = """
-Examples:
-  > SELECT name, age, count(*) FROM VALUES (2, 'Alice'), (5, 'Bob') 
people(age, name) GROUP BY _FUNC_(name, age);
-Bob5   1
-Alice  2   1
-Alice  NULL1
-NULL   2   1
-NULL   NULL2
-BobNULL1
-NULL   5   1
-  """,
-  since = "2.0.0")
-// scalastyle:on line.size.limit line.contains.tab
-case class Cube(groupByExprs: Seq[Expression]) extends GroupingSet {}
+object GroupingSet {
+  /*
+   *  GROUP BY a, b, c WITH ROLLUP
+   *  is equivalent to
+   *  GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( ) ).
+   *  Group Count: N + 1 (N is the number of group expressions)
+   *
+   *  We need to get all of its subsets for the rule described above, the 
subset is
+   *  represented as sequence of expressions.
+   */
+  def rollupExprs(exprs: Seq[Seq[Expression]]): Seq[Seq[Expression]] =
+exprs.inits.map(_.flatten).toIndexedSeq
 
-// scalastyle:off line.size.limit line.contains.tab
-@ExpressionDescription(
-  usage = """
-_FUNC_([col1[, col2 ..]]) - create a multi-dimensional rollup using the 
specified columns
-  so that we can run aggregation on them.
-  """,
-  examples = """
-Examples:
-  > SELECT name, age, count(*) FROM VALUES (2, 'Alice'), (5, 'Bob') 
people(age, name) GROUP BY _FUNC_(name, age);
-Bob5   1
-Alice  2   1
-Alice  NULL1
-NULL   NULL2
-BobNULL1
-  """,
-  since = "2.0.0")
-// scalastyle:on line.size.limit line.contains.tab
-case class Rollup(groupByExprs: Seq[Expression]) extends GroupingSet {}
+  /*
+   *  GROUP BY a, b, c WITH CUBE

Review comment:
   Done

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala
##
@@ -39,45 +41,64 @@ trait GroupingSet extends Expression with CodegenFallback {
   override def eval(input: InternalRow): Any = throw new 
UnsupportedOperationException
 }
 
-// scalastyle:off line.size.limit line.contains.tab
-@ExpressionDescription(
-  usage = """
-_FUNC_([col1[, col2 ..]]) - create a multi-dimensional cube using the 
specified columns
-  so that we can run aggregation on them.
-  """,
-  examples = """
-Examples:
-  > SELECT name, age, count(*) FROM VALUES (2, 'Alice'), (5, 'Bob') 
people(age, name) GROUP BY _FUNC_(name, age);
-Bob5   1
-Alice  2   1
-Alice  NULL1
-NULL   2   1
-NULL   NULL2
-BobNULL1
-NULL   5   1
-  """,
-  since = "2.0.0")
-// scalastyle:on line.size.limit line.contains.tab
-case class Cube(groupByExprs: Seq[Expression]) extends GroupingSet {}
+object GroupingSet {
+  /*
+   *  GROUP BY a, b, c WITH ROLLUP
+   *  is equivalent to
+   *  GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( ) ).
+   *  Group Count: N + 1 (N is the number of group expressions)
+   *
+   *  We need to get all of its subsets for the rule described above, the 
subset is
+   *  represented as sequence of expressions.
+   */
+  de

[GitHub] [spark] SparkQA removed a comment on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735340981


   **[Test build #131922 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131922/testReport)**
 for PR 29966 at commit 
[`e921245`](https://github.com/apache/spark/commit/e92124581e76767929883d54e2eb6d97e0502aac).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


SparkQA commented on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735351504


   **[Test build #131922 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131922/testReport)**
 for PR 29966 at commit 
[`e921245`](https://github.com/apache/spark/commit/e92124581e76767929883d54e2eb6d97e0502aac).
* This patch **fails SparkR unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735350202







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735350202







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735341493


   **[Test build #131923 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131923/testReport)**
 for PR 30535 at commit 
[`b363d0c`](https://github.com/apache/spark/commit/b363d0c5d4271fd070c4d6a4a2090bd032cbe009).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


SparkQA commented on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735349100


   **[Test build #131923 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131923/testReport)**
 for PR 30535 at commit 
[`b363d0c`](https://github.com/apache/spark/commit/b363d0c5d4271fd070c4d6a4a2090bd032cbe009).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a change in pull request #30521: [SPARK-33577][SS] Add support for V1Table in stream writer table API

2020-11-28 Thread GitBox


zsxwing commented on a change in pull request #30521:
URL: https://github.com/apache/spark/pull/30521#discussion_r532158522



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala
##
@@ -304,46 +308,68 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
* @since 3.1.0
*/
   @throws[TimeoutException]
-  def saveAsTable(tableName: String): StreamingQuery = {
-this.source = SOURCE_NAME_TABLE
+  def table(tableName: String): StreamingQuery = {

Review comment:
   `saveAsTable` sounds weird to me since `DataStreamWriter` doesn't have a 
`save` method like `DataFrameWriter`.
   
   I prefer `DataStreamWriter.table` as it's consistent with 
`DataStreamReader.table`: `DataStreamReader.table` triggers `load` and 
`DataStreamWrite.table` triggers `start`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735346176







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735346176







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a change in pull request #30492: [SPARK-33545][CORE] Support Fallback Storage during Worker decommission

2020-11-28 Thread GitBox


zsxwing commented on a change in pull request #30492:
URL: https://github.com/apache/spark/pull/30492#discussion_r532157528



##
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##
@@ -471,6 +471,15 @@ package object config {
 "cache block replication should be positive.")
   .createWithDefaultString("30s")
 
+  private[spark] val STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH =
+ConfigBuilder("spark.storage.decommission.fallbackStorage.path")
+  .doc("The location for fallback storage during block manager 
decommissioning." +

Review comment:
   nit: it's better to remind the user to enable TTL in the doc since Spark 
doesn't clean up the fallback storage.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30534: [SPARK-33582][SQL] Hive Metastore support filter by not-equals

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30534:
URL: https://github.com/apache/spark/pull/30534#issuecomment-735345323







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30534: [SPARK-33582][SQL] Hive Metastore support filter by not-equals

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30534:
URL: https://github.com/apache/spark/pull/30534#issuecomment-735345323







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a change in pull request #30492: [SPARK-33545][CORE] Support Fallback Storage during Worker decommission

2020-11-28 Thread GitBox


zsxwing commented on a change in pull request #30492:
URL: https://github.com/apache/spark/pull/30492#discussion_r532157411



##
File path: core/src/main/scala/org/apache/spark/storage/FallbackStorage.scala
##
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.storage
+
+import java.io.DataInputStream
+import java.nio.ByteBuffer
+
+import scala.concurrent.Future
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.internal.Logging
+import 
org.apache.spark.internal.config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH
+import org.apache.spark.network.buffer.{ManagedBuffer, NioManagedBuffer}
+import org.apache.spark.rpc.{RpcAddress, RpcEndpointRef, RpcTimeout}
+import org.apache.spark.shuffle.{IndexShuffleBlockResolver, ShuffleBlockInfo}
+import org.apache.spark.shuffle.IndexShuffleBlockResolver.NOOP_REDUCE_ID
+import org.apache.spark.util.Utils
+
+/**
+ * A fallback storage used by storage decommissioners.
+ */
+private[storage] class FallbackStorage(conf: SparkConf) extends Logging {
+  require(conf.get(STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined)
+
+  private val fallbackPath = new 
Path(conf.get(STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).get)
+  private val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
+  private val fallbackFileSystem = FileSystem.get(fallbackPath.toUri, 
hadoopConf)
+
+  // Visible for testing
+  def copy(
+  shuffleBlockInfo: ShuffleBlockInfo,
+  bm: BlockManager): Unit = {
+val shuffleId = shuffleBlockInfo.shuffleId
+val mapId = shuffleBlockInfo.mapId
+
+bm.migratableResolver match {
+  case r: IndexShuffleBlockResolver =>
+val indexFile = r.getIndexFile(shuffleId, mapId)
+
+if (indexFile.exists()) {
+  fallbackFileSystem.copyFromLocalFile(
+new Path(indexFile.getAbsolutePath),
+new Path(fallbackPath, s"$shuffleId/${indexFile.getName}"))
+
+  val dataFile = r.getDataFile(shuffleId, mapId)
+  if (dataFile.exists()) {
+fallbackFileSystem.copyFromLocalFile(
+  new Path(dataFile.getAbsolutePath),
+  new Path(fallbackPath, s"$shuffleId/${dataFile.getName}"))

Review comment:
   Should we add the application id to the path? For example, a user may 
just run the same codes multiple times. It would be more convenient if the user 
doesn't need to use a different configuration value for each run.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30534: [SPARK-33582][SQL] Hive Metastore support filter by not-equals

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #30534:
URL: https://github.com/apache/spark/pull/30534#issuecomment-735320260


   **[Test build #131920 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131920/testReport)**
 for PR 30534 at commit 
[`22e149d`](https://github.com/apache/spark/commit/22e149d43729bb6efcc15391dba2ac1c9b9f223e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30534: [SPARK-33582][SQL] Hive Metastore support filter by not-equals

2020-11-28 Thread GitBox


SparkQA commented on pull request #30534:
URL: https://github.com/apache/spark/pull/30534#issuecomment-735342825


   **[Test build #131920 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131920/testReport)**
 for PR 30534 at commit 
[`22e149d`](https://github.com/apache/spark/commit/22e149d43729bb6efcc15391dba2ac1c9b9f223e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `case class UnixTimestamp(`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735341535







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735341535







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


SparkQA commented on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735341493


   **[Test build #131923 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131923/testReport)**
 for PR 30535 at commit 
[`b363d0c`](https://github.com/apache/spark/commit/b363d0c5d4271fd070c4d6a4a2090bd032cbe009).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


wangyum commented on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735341289


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735340971







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735340971







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


SparkQA commented on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735340981


   **[Test build #131922 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131922/testReport)**
 for PR 29966 at commit 
[`e921245`](https://github.com/apache/spark/commit/e92124581e76767929883d54e2eb6d97e0502aac).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735320251


   **[Test build #131919 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131919/testReport)**
 for PR 30535 at commit 
[`b363d0c`](https://github.com/apache/spark/commit/b363d0c5d4271fd070c4d6a4a2090bd032cbe009).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30534: [SPARK-33582][SQL] Hive Metastore support filter by not-equals

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30534:
URL: https://github.com/apache/spark/pull/30534#issuecomment-735340836







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735340837







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


SparkQA commented on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735340885


   **[Test build #131919 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131919/testReport)**
 for PR 30535 at commit 
[`b363d0c`](https://github.com/apache/spark/commit/b363d0c5d4271fd070c4d6a4a2090bd032cbe009).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30534: [SPARK-33582][SQL] Hive Metastore support filter by not-equals

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30534:
URL: https://github.com/apache/spark/pull/30534#issuecomment-735340836







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735340837







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735338418







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735314423


   **[Test build #131918 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131918/testReport)**
 for PR 30528 at commit 
[`312f042`](https://github.com/apache/spark/commit/312f0422f7c6379747762c0b2eaf523e76c96a9b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735337456


   **[Test build #131921 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131921/testReport)**
 for PR 29966 at commit 
[`875d8a7`](https://github.com/apache/spark/commit/875d8a7b640b8112d280e49d827677215e3fb1b8).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735337600







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


SparkQA commented on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735338285


   **[Test build #131918 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131918/testReport)**
 for PR 30528 at commit 
[`312f042`](https://github.com/apache/spark/commit/312f0422f7c6379747762c0b2eaf523e76c96a9b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735337600







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


SparkQA commented on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735337596


   **[Test build #131921 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131921/testReport)**
 for PR 29966 at commit 
[`875d8a7`](https://github.com/apache/spark/commit/875d8a7b640b8112d280e49d827677215e3fb1b8).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


SparkQA commented on pull request #29966:
URL: https://github.com/apache/spark/pull/29966#issuecomment-735337456


   **[Test build #131921 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131921/testReport)**
 for PR 29966 at commit 
[`875d8a7`](https://github.com/apache/spark/commit/875d8a7b640b8112d280e49d827677215e3fb1b8).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on pull request #28036: [SPARK-26341][CORE]Expose executor memory metrics at the stage level, in the Stages tab

2020-11-28 Thread GitBox


AngersZh commented on pull request #28036:
URL: https://github.com/apache/spark/pull/28036#issuecomment-735322071


   > cc @AngersZh I will be happy to help. My work email is 
[r...@linkedin.com](mailto:r...@linkedin.com). What is your email? Can you show 
this problem with some unit test? Or I have to deploy the software on Hadoop in 
order to show the problem? Please advise.
   
   can't show with UT, just build with this par and start spark application, 
run some sql, do as 
https://github.com/apache/spark/pull/28036#issuecomment-624502374 said, and you 
will found that the page keep loading and open developer tool, to see the web 
output you will see the error.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29966: [SPARK-33084][CORE][SQL] Add jar support ivy path

2020-11-28 Thread GitBox


AngersZh commented on a change in pull request #29966:
URL: https://github.com/apache/spark/pull/29966#discussion_r532135394



##
File path: core/src/main/scala/org/apache/spark/util/DependencyUtils.scala
##
@@ -25,12 +25,95 @@ import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileSystem, Path}
 
 import org.apache.spark.{SecurityManager, SparkConf, SparkException}
+import org.apache.spark.deploy.SparkSubmitUtils
 import org.apache.spark.internal.Logging
-import org.apache.spark.util.{MutableURLClassLoader, Utils}
 
-private[deploy] object DependencyUtils extends Logging {
+private[spark] object DependencyUtils extends Logging {
+
+  def getIvyProperties(): Seq[String] = {
+Seq(
+  "spark.jars.excludes",
+  "spark.jars.packages",
+  "spark.jars.repositories",
+  "spark.jars.ivy",
+  "spark.jars.ivySettings"
+).map(sys.props.get(_).orNull)
+  }
+
+
+  private def parseURLQueryParameter(queryString: String, queryTag: String): 
Array[String] = {
+if (queryString == null || queryString.isEmpty) {
+  Array.empty[String]
+} else {
+  val mapTokens = queryString.split("&")
+  assert(mapTokens.forall(_.split("=").length == 2)
+, "Invalid URI query string: [ " + queryString + " ]")
+  mapTokens.map(_.split("=")).map(kv => (kv(0), kv(1))).filter(_._1 == 
queryTag).map(_._2)
+}
+  }
+
+  /**
+   * Parse excluded list in ivy URL. When download ivy URL jar, Spark won't 
download transitive jar
+   * in excluded list.
+   *
+   * @param queryString Ivy URI query part string.
+   * @return Exclude list which contains grape parameters of exclude.
+   * Example: Input:  
exclude=org.mortbay.jetty:jetty,org.eclipse.jetty:jetty-http
+   * Output:  [org.mortbay.jetty:jetty, org.eclipse.jetty:jetty-http]
+   */
+  private def parseExcludeList(queryString: String): String = {
+parseURLQueryParameter(queryString, "exclude")
+  .flatMap { excludeString =>
+val excludes: Array[String] = excludeString.split(",")
+assert(excludes.forall(_.split(":").length == 2),
+  "Invalid exclude string: expected 'org:module,org:module,..'," +
+" found [ " + excludeString + " ]")
+excludes
+  }.mkString(":")
+  }
+
+  /**
+   * Parse transitive parameter in ivy URL, default value is false.
+   *
+   * @param queryString Ivy URI query part string.
+   * @return Exclude list which contains grape parameters of transitive.
+   * Example: Input:  exclude=org.mortbay.jetty:jetty&transitive=true
+   * Output:  true
+   */
+  private def parseTransitive(queryString: String): Boolean = {
+val transitive = parseURLQueryParameter(queryString, "transitive")
+if (transitive.isEmpty) {
+  false
+} else {
+  if (transitive.length > 1) {
+logWarning("It's best to specify `transitive` parameter in ivy URL 
query only once." +
+  " If there are multiple `transitive` parameter, we will select the 
last one")
+  }
+  transitive.last.toBoolean
+}
+  }
+
+  /**
+   * Download Ivy URIs dependent jars.

Review comment:
   Done

##
File path: core/src/main/scala/org/apache/spark/util/DependencyUtils.scala
##
@@ -25,12 +25,95 @@ import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileSystem, Path}
 
 import org.apache.spark.{SecurityManager, SparkConf, SparkException}
+import org.apache.spark.deploy.SparkSubmitUtils
 import org.apache.spark.internal.Logging
-import org.apache.spark.util.{MutableURLClassLoader, Utils}
 
-private[deploy] object DependencyUtils extends Logging {
+private[spark] object DependencyUtils extends Logging {
+
+  def getIvyProperties(): Seq[String] = {
+Seq(
+  "spark.jars.excludes",
+  "spark.jars.packages",
+  "spark.jars.repositories",
+  "spark.jars.ivy",
+  "spark.jars.ivySettings"
+).map(sys.props.get(_).orNull)
+  }
+
+
+  private def parseURLQueryParameter(queryString: String, queryTag: String): 
Array[String] = {
+if (queryString == null || queryString.isEmpty) {
+  Array.empty[String]
+} else {
+  val mapTokens = queryString.split("&")
+  assert(mapTokens.forall(_.split("=").length == 2)
+, "Invalid URI query string: [ " + queryString + " ]")
+  mapTokens.map(_.split("=")).map(kv => (kv(0), kv(1))).filter(_._1 == 
queryTag).map(_._2)
+}
+  }
+
+  /**
+   * Parse excluded list in ivy URL. When download ivy URL jar, Spark won't 
download transitive jar
+   * in excluded list.
+   *
+   * @param queryString Ivy URI query part string.
+   * @return Exclude list which contains grape parameters of exclude.
+   * Example: Input:  
exclude=org.mortbay.jetty:jetty,org.eclipse.jetty:jetty-http
+   * Output:  [org.mortbay.jetty:jetty, org.eclipse.jetty:jetty-http]
+   */
+  private def parseExcludeList(queryString: String): String = {
+parseURLQueryParameter(queryString, "exclude")
+

[GitHub] [spark] SparkQA commented on pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


SparkQA commented on pull request #30535:
URL: https://github.com/apache/spark/pull/30535#issuecomment-735320251


   **[Test build #131919 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131919/testReport)**
 for PR 30535 at commit 
[`b363d0c`](https://github.com/apache/spark/commit/b363d0c5d4271fd070c4d6a4a2090bd032cbe009).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30534: [SPARK-33582][SQL] Hive Metastore support filter by not-equals

2020-11-28 Thread GitBox


SparkQA commented on pull request #30534:
URL: https://github.com/apache/spark/pull/30534#issuecomment-735320260


   **[Test build #131920 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131920/testReport)**
 for PR 30534 at commit 
[`22e149d`](https://github.com/apache/spark/commit/22e149d43729bb6efcc15391dba2ac1c9b9f223e).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum opened a new pull request #30535: [SPARK-33584][SQL] Hive Metastore support filter by string type with date values

2020-11-28 Thread GitBox


wangyum opened a new pull request #30535:
URL: https://github.com/apache/spark/pull/30535


   ### What changes were proposed in this pull request?
   
   This pr make Hive Metastore support filter by string type with date values, 
a common case is:
   ```scala
   spark.sql("create table t1(id string) partitioned by (part string) stored as 
parquet")
   spark.sql("insert into t1 values('1', '2019-01-01')")
   spark.sql("insert into t1 values('2', '2019-01-02')")
   spark.sql("select * from t1 where  part = date '2019-01-01' ").show
   ```
   
   By 
[default](https://github.com/apache/spark/blob/6d31daeb6a2c5607ffe3b23ffb381626ad57f576/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L126-L129),
 we use `DateType` to compare `StringType` and `DateType`.
   
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735318912







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735318912







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum opened a new pull request #30534: [SPARK-33582][SQL] Hive Metastore support filter by not-equals

2020-11-28 Thread GitBox


wangyum opened a new pull request #30534:
URL: https://github.com/apache/spark/pull/30534


   ### What changes were proposed in this pull request?
   
   This pr make partition predicate pushdown into Hive metastore support 
not-equals operator.
   
   Hive related changes:
   
https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207
   https://issues.apache.org/jira/browse/HIVE-2702
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735305414


   **[Test build #131916 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131916/testReport)**
 for PR 30528 at commit 
[`1ec1c1d`](https://github.com/apache/spark/commit/1ec1c1dc4baea6235d8a1e0e6d2b72790c950f51).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


SparkQA commented on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735316930


   **[Test build #131916 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131916/testReport)**
 for PR 30528 at commit 
[`1ec1c1d`](https://github.com/apache/spark/commit/1ec1c1dc4baea6235d8a1e0e6d2b72790c950f51).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #30525: [SPARK-33581][SQL][TEST] Refactor HivePartitionFilteringSuite

2020-11-28 Thread GitBox


wangyum commented on pull request #30525:
URL: https://github.com/apache/spark/pull/30525#issuecomment-735315745


   Merged to master.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum closed pull request #30525: [SPARK-33581][SQL][TEST] Refactor HivePartitionFilteringSuite

2020-11-28 Thread GitBox


wangyum closed pull request #30525:
URL: https://github.com/apache/spark/pull/30525


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ron8hu commented on pull request #28036: [SPARK-26341][CORE]Expose executor memory metrics at the stage level, in the Stages tab

2020-11-28 Thread GitBox


ron8hu commented on pull request #28036:
URL: https://github.com/apache/spark/pull/28036#issuecomment-735314812


   cc @AngersZh I will be happy to help.  My work email is 
r...@linkedin.com.  What is your email?   Can you show this problem with some 
unit test?  Or I have to deploy the software on Hadoop in order to show the 
problem?  Please advise.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735314485







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735314485







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


SparkQA commented on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735314423


   **[Test build #131918 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131918/testReport)**
 for PR 30528 at commit 
[`312f042`](https://github.com/apache/spark/commit/312f0422f7c6379747762c0b2eaf523e76c96a9b).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735314155







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735314155







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30533: [SPARK-33380] [EXAMPLES] New python example (series-pi.py) to calculate pi via series

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30533:
URL: https://github.com/apache/spark/pull/30533#issuecomment-735313995


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735305036


   **[Test build #131915 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131915/testReport)**
 for PR 30528 at commit 
[`2720b60`](https://github.com/apache/spark/commit/2720b602e1b263d7c0115eab4ecc0b23d18962fc).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29994: [DONOTMERGE][WHITESPACE] workflow exercise

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #29994:
URL: https://github.com/apache/spark/pull/29994#issuecomment-735313906







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


SparkQA commented on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735313956


   **[Test build #131915 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131915/testReport)**
 for PR 30528 at commit 
[`2720b60`](https://github.com/apache/spark/commit/2720b602e1b263d7c0115eab4ecc0b23d18962fc).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29994: [DONOTMERGE][WHITESPACE] workflow exercise

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #29994:
URL: https://github.com/apache/spark/pull/29994#issuecomment-735313906







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jsoref commented on pull request #30323: Spelling

2020-11-28 Thread GitBox


jsoref commented on pull request #30323:
URL: https://github.com/apache/spark/pull/30323#issuecomment-735313703


   Things I know will remain:
   
   * `create` (it spans multiple directories) -- we can address this after the 
current open PRs resolve (probably using this PR)
   * `enabled` (`legacy_setops_precedence_enbled` appears to be a public API -- 
addressing it would be done as its own distinct PR if at all -- one approach is 
to add a correct spelling making that the preferred and adding a deprecated 
annotation to the current spelling -- another approach is to just add a comment 
acknowledging the API botch)
   
   I think that's everything, but we'll see.
   
   Note to self: I changed one `E.g.` to `For example` (which actually fit w/ a 
second one that I did the same to earlier), so I'm going to have a merge 
conflict: resolver=drop.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #29994: [DONOTMERGE][WHITESPACE] workflow exercise

2020-11-28 Thread GitBox


SparkQA removed a comment on pull request #29994:
URL: https://github.com/apache/spark/pull/29994#issuecomment-735309349


   **[Test build #131917 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131917/testReport)**
 for PR 29994 at commit 
[`995d434`](https://github.com/apache/spark/commit/995d434531efc727d19538341a08dd9fa15f8c86).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29994: [DONOTMERGE][WHITESPACE] workflow exercise

2020-11-28 Thread GitBox


SparkQA commented on pull request #29994:
URL: https://github.com/apache/spark/pull/29994#issuecomment-735313444


   **[Test build #131917 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131917/testReport)**
 for PR 29994 at commit 
[`995d434`](https://github.com/apache/spark/commit/995d434531efc727d19538341a08dd9fa15f8c86).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] milindvdamle opened a new pull request #30533: [SPARK-33380] [EXAMPLES] New python example (series-pi.py) to calculate pi via series

2020-11-28 Thread GitBox


milindvdamle opened a new pull request #30533:
URL: https://github.com/apache/spark/pull/30533


   
   
   ### What changes were proposed in this pull request?
   
   
   
   
   ### Why are the changes needed?
   The current example script (pi.py) produces an approximate value for pi 
based on randomly-chosen points and is neither deterministic nor accurate. A 
new example is needed to fix this issue. 
   https://en.wikipedia.org/wiki/Approximations_of_%CF%80 
   http://oeis.org/wiki/Pi#Decimal_expansion_of_.CF.80.E2.80.8A2
   https://mathworld.wolfram.com/PiFormulas.html
   
   
![image](https://user-images.githubusercontent.com/9453471/100529357-97b75b00-319b-11eb-8e46-78ea845db567.png)
   
   This new example calculates pi using a partial sum of series that implements 
the formula shown in the attached image.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, it provides an additional example python script (series-pi.py) to 
calculate pi. 
   
   ### How was this patch tested?
   The script accepts two optional command-line parameters. The first parameter 
specifies the number of terms to use to approximate pi. The second parameter 
specifies the number of Spark partitions to use to calculate the result.
   
   The script was run multiple times with 10 million terms, 100 million terms, 
500 million terms and 1 billion terms each as the value of the first parameter. 
The resulting approximate value for pi was verified against the published/known 
approximation (https://oeis.org/A000796).
   Two things were noted and verified:
   1. For the same number of terms provided, it was verified that each run of 
the script produced a deterministic value for pi.
   2. Increasing the number of terms (first parameter) produced a more accurate 
approximation.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jsoref commented on a change in pull request #30532: [MINOR] Spelling sql not core

2020-11-28 Thread GitBox


jsoref commented on a change in pull request #30532:
URL: https://github.com/apache/spark/pull/30532#discussion_r532126432



##
File path: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##
@@ -21,7 +21,7 @@ grammar SqlBase;
* When false, INTERSECT is given the greater precedence over the other set
* operations (UNION, EXCEPT and MINUS) as per the SQL standard.
*/
-  public boolean legacy_setops_precedence_enbled = false;
+  public boolean legacy_setops_precedence_enabled = false;

Review comment:
   GitHub says maybe: 
https://github.com/search?q=legacy_setops_precedence_enbled&type=code
   
   I'm going to drop this for now, we can revisit it separately.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


dongjoon-hyun commented on a change in pull request #30528:
URL: https://github.com/apache/spark/pull/30528#discussion_r532125888



##
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##
@@ -1946,6 +1946,16 @@ package object config {
   .booleanConf
   .createWithDefault(false)
 
+  private[spark] val EXECUTOR_KILL_ON_FATAL_ERROR_DEPTH =
+ConfigBuilder("spark.executor.killOnFatalError.depth")
+  .doc("The max depth of the exception chain in a failed task Spark will 
search for a fatal " +
+"error to check whether it should kill an executor. 0 means not 
checking any fatal " +
+"error, 1 means checking only the exception but not the cause, and so 
on.")

Review comment:
   Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a change in pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


zsxwing commented on a change in pull request #30528:
URL: https://github.com/apache/spark/pull/30528#discussion_r532125783



##
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##
@@ -1946,6 +1946,16 @@ package object config {
   .booleanConf
   .createWithDefault(false)
 
+  private[spark] val EXECUTOR_KILL_ON_FATAL_ERROR_DEPTH =
+ConfigBuilder("spark.executor.killOnFatalError.depth")
+  .doc("The max depth of the exception chain in a failed task Spark will 
search for a fatal " +
+"error to check whether it should kill an executor. 0 means not 
checking any fatal " +
+"error, 1 means checking only the exception but not the cause, and so 
on.")

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jsoref commented on a change in pull request #30531: [MINOR] Spelling sql/core

2020-11-28 Thread GitBox


jsoref commented on a change in pull request #30531:
URL: https://github.com/apache/spark/pull/30531#discussion_r532125721



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##
@@ -188,13 +188,13 @@ private[parquet] class ParquetRowConverter(
*/
   def currentRecord: InternalRow = currentRow
 
-  private val dateRebaseFunc = DataSourceUtils.creteDateRebaseFuncInRead(
+  private val dateRebaseFunc = DataSourceUtils.createDateRebaseFuncInRead(

Review comment:
   Sigh, yeah, this'll definitely fail. I've hit this repeatedly while 
dropping things. Lemme drop it from here.
   
   It can merge in the root PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #30527: [MINOR][SQL] Remove `getTables()` from `r.SQLUtils`

2020-11-28 Thread GitBox


dongjoon-hyun closed pull request #30527:
URL: https://github.com/apache/spark/pull/30527


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a change in pull request #30532: Spelling sql not core

2020-11-28 Thread GitBox


srowen commented on a change in pull request #30532:
URL: https://github.com/apache/spark/pull/30532#discussion_r532124633



##
File path: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
##
@@ -21,7 +21,7 @@ grammar SqlBase;
* When false, INTERSECT is given the greater precedence over the other set
* operations (UNION, EXCEPT and MINUS) as per the SQL standard.
*/
-  public boolean legacy_setops_precedence_enbled = false;
+  public boolean legacy_setops_precedence_enabled = false;

Review comment:
   Oh hm I wonder if this has any implications elsewhere, like does this 
affect any non-private generated code?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #29442: [SPARK-32627][SQL][WEBUI] Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage

2020-11-28 Thread GitBox


github-actions[bot] commented on pull request #29442:
URL: https://github.com/apache/spark/pull/29442#issuecomment-735311304


   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a change in pull request #30531: Spelling sql/core

2020-11-28 Thread GitBox


srowen commented on a change in pull request #30531:
URL: https://github.com/apache/spark/pull/30531#discussion_r532124529



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
##
@@ -188,13 +188,13 @@ private[parquet] class ParquetRowConverter(
*/
   def currentRecord: InternalRow = currentRow
 
-  private val dateRebaseFunc = DataSourceUtils.creteDateRebaseFuncInRead(
+  private val dateRebaseFunc = DataSourceUtils.createDateRebaseFuncInRead(

Review comment:
   Oh which of the PRs has the method rename? I wonder if others will fail 
if they don't have that change as well.

##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala
##
@@ -170,23 +170,23 @@ class DatasetPrimitiveSuite extends QueryTest with 
SharedSparkSession {
   test("groupBy function, map") {
 val ds = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11).toDS()
 val grouped = ds.groupByKey(_ % 2)
-val agged = grouped.mapGroups { (g, iter) =>
+val aggregated = grouped.mapGroups { (g, iter) =>

Review comment:
   I don't feel strongly either way, but these don't necessarily need to be 
renamed. They're intentionally shorthand it seems. It might just reduce the 
number of lines touched.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


dongjoon-hyun commented on a change in pull request #30528:
URL: https://github.com/apache/spark/pull/30528#discussion_r532124495



##
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##
@@ -1946,6 +1946,16 @@ package object config {
   .booleanConf
   .createWithDefault(false)
 
+  private[spark] val EXECUTOR_KILL_ON_FATAL_ERROR_DEPTH =
+ConfigBuilder("spark.executor.killOnFatalError.depth")
+  .doc("The max depth of the exception chain in a failed task Spark will 
search for a fatal " +
+"error to check whether it should kill an executor. 0 means not 
checking any fatal " +
+"error, 1 means checking only the exception but not the cause, and so 
on.")

Review comment:
   Although this is `internal`, shall we add Spark version, @zsxwing ? 
Since SPARK-33587 is filed as `Improvement`, `3.1.0`?
   ```scala
   .version("3.1.0")
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


dongjoon-hyun commented on a change in pull request #30528:
URL: https://github.com/apache/spark/pull/30528#discussion_r532124495



##
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##
@@ -1946,6 +1946,16 @@ package object config {
   .booleanConf
   .createWithDefault(false)
 
+  private[spark] val EXECUTOR_KILL_ON_FATAL_ERROR_DEPTH =
+ConfigBuilder("spark.executor.killOnFatalError.depth")
+  .doc("The max depth of the exception chain in a failed task Spark will 
search for a fatal " +
+"error to check whether it should kill an executor. 0 means not 
checking any fatal " +
+"error, 1 means checking only the exception but not the cause, and so 
on.")

Review comment:
   Could you add Spark version, @zsxwing ? Since SPARK-33587 is filed as 
`Improvement`, `3.1.0`?
   ```scala
   .version("3.1.0")
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a change in pull request #30530: Spelling bin core docs external mllib repl

2020-11-28 Thread GitBox


srowen commented on a change in pull request #30530:
URL: https://github.com/apache/spark/pull/30530#discussion_r532124096



##
File path: docs/sql-migration-guide.md
##
@@ -333,7 +333,7 @@ license: |
 
 
 
-  - Since Spark 2.4, when there is a struct field in front of the IN operator 
before a subquery, the inner query must contain a struct field as well. In 
previous versions, instead, the fields of the struct were compared to the 
output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in Spark 
2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a 
in (select 1, 'a' from range(1))` is not. In previous version it was the 
opposite.
+  - Since Spark 2.4, when there is a struct field in front of the IN operator 
before a subquery, the inner query must contain a struct field as well. In 
previous versions, instead, the fields of the struct were compared to the 
output of the inner query. E.g. if `a` is a `struct(a string, b int)`, in Spark 
2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, while `a 
in (select 1, 'a' from range(1))` is not. In previous version it was the 
opposite.

Review comment:
   Nit, for if we need to fix anything else: "E.g." probably should not 
begin a sentence. "For example," is better

##
File path: docs/sql-migration-guide.md
##
@@ -532,11 +532,11 @@ license: |
 
  - Since Spark 2.3, by default arithmetic operations between decimals return a 
rounded value if an exact representation is not possible (instead of returning 
NULL). This is compliant with SQL ANSI 2011 specification and Hive's new 
behavior introduced in Hive 2.2 (HIVE-15331). This involves the following 
changes
 
-- The rules to determine the result type of an arithmetic operation have 
been updated. In particular, if the precision / scale needed are out of the 
range of available values, the scale is reduced up to 6, in order to prevent 
the truncation of the integer part of the decimals. All the arithmetic 
operations are affected by the change, ie. addition (`+`), subtraction (`-`), 
multiplication (`*`), division (`/`), remainder (`%`) and positive module 
(`pmod`).
+- The rules to determine the result type of an arithmetic operation have 
been updated. In particular, if the precision / scale needed are out of the 
range of available values, the scale is reduced up to 6, in order to prevent 
the truncation of the integer part of the decimals. All the arithmetic 
operations are affected by the change, i.e. addition (`+`), subtraction (`-`), 
multiplication (`*`), division (`/`), remainder (`%`) and positive module 
(`pmod`).

Review comment:
   Also let's fix "positive module" to "positive modulus"





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30510: [SPARK-33565][INFRA][FOLLOW-UP] Keep the test coverage with Python 3.8 in GitHub Actions

2020-11-28 Thread GitBox


dongjoon-hyun commented on pull request #30510:
URL: https://github.com/apache/spark/pull/30510#issuecomment-735309363


   +1, late LGTM.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a change in pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2020-11-28 Thread GitBox


zsxwing commented on a change in pull request #28363:
URL: https://github.com/apache/spark/pull/28363#discussion_r532122900



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##
@@ -136,8 +136,9 @@ class FileStreamSink(
   private val basePath = new Path(path)
   private val logPath = getMetadataLogPath(basePath.getFileSystem(hadoopConf), 
basePath,
 sparkSession.sessionState.conf)
-  private val fileLog =
-new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, 
logPath.toString)
+  private val outputTimeToLive = options.get("outputRetentionMs").map(_.toLong)

Review comment:
   Nit: it would be great to output an info log for this value if it's set. 
It might be useful when debugging data issues caused by the retention.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #30511: [SPARK-33565][INFRA][FOLLOW-UP][3.0] Keep the test coverage with Python 3.8 in GitHub Actions

2020-11-28 Thread GitBox


dongjoon-hyun commented on pull request #30511:
URL: https://github.com/apache/spark/pull/30511#issuecomment-735309344


   +1, late LGTM.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29994: [DONOTMERGE][WHITESPACE] workflow exercise

2020-11-28 Thread GitBox


SparkQA commented on pull request #29994:
URL: https://github.com/apache/spark/pull/29994#issuecomment-735309349


   **[Test build #131917 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131917/testReport)**
 for PR 29994 at commit 
[`995d434`](https://github.com/apache/spark/commit/995d434531efc727d19538341a08dd9fa15f8c86).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30529: [SPARK-33588][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SHOW TABLE EXTENDED`

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30529:
URL: https://github.com/apache/spark/pull/30529#issuecomment-735308969







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30528: [SPARK-33587][Core]Kill the executor on nested fatal errors

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30528:
URL: https://github.com/apache/spark/pull/30528#issuecomment-735308970







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #30523: [SPARK-33579][CORE] Fix executor blank pages behind proxy

2020-11-28 Thread GitBox


AmplabJenkins removed a comment on pull request #30523:
URL: https://github.com/apache/spark/pull/30523#issuecomment-735308971







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30492: [SPARK-33545][CORE] Support Fallback Storage during Worker decommission

2020-11-28 Thread GitBox


dongjoon-hyun commented on a change in pull request #30492:
URL: https://github.com/apache/spark/pull/30492#discussion_r532122517



##
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##
@@ -627,7 +627,16 @@ private[spark] class BlockManager(
   override def getLocalBlockData(blockId: BlockId): ManagedBuffer = {
 if (blockId.isShuffle) {
   logDebug(s"Getting local shuffle block ${blockId}")
-  shuffleManager.shuffleBlockResolver.getBlockData(blockId)
+  try {
+shuffleManager.shuffleBlockResolver.getBlockData(blockId)
+  } catch {
+case e: IOException =>
+  if 
(conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined) {
+FallbackStorage.read(conf, blockId)
+  } else {
+throw e
+  }

Review comment:
   It's already `IOException` and `FallbackStorage.read` throws 
`IOException` for non-exist files and it's fine and legitimate in the 
`WorkerDecomission` context, @viirya . Please note the following.
   - The whole Worker decommission (including `shuffle and rdd storage 
decommission`) is designed as a best-effort approach.
   - It's because the main use case is K8s graceful shutdown with the default 
period (`30s`). We can increase the period, but we cannot set it to the 
infinite value technically. It means executor hangs in case of disk full 
situation.
   - What we are aiming is to rescue data as much as possible, but `100%` is 
not guaranteed always.
   - Lastly, data block selection(shuffle or rdd) was a random-order from the 
beginning. It became worse when there are multiple shuffle across multiple 
executors.
   
   Due to the above reasons, we introduced `SPARK-33387 Support ordered shuffle 
block migration` to rescue as complete as possible. However, we can still 
easily imagine that multiple shuffles coexist on an executor in a skewed 
manner. While the other executor succeeds to complete the migration, the skewed 
executor may fail to complete the migration in the given grace period.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30531: Spelling sql/core

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30531:
URL: https://github.com/apache/spark/pull/30531#issuecomment-735309023


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #30532: Spelling sql not core

2020-11-28 Thread GitBox


AmplabJenkins commented on pull request #30532:
URL: https://github.com/apache/spark/pull/30532#issuecomment-735309019


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a change in pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2020-11-28 Thread GitBox


zsxwing commented on a change in pull request #28363:
URL: https://github.com/apache/spark/pull/28363#discussion_r532122181



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala
##
@@ -96,6 +97,19 @@ class FileStreamSinkLog(
   require(defaultCompactInterval > 0,
 s"Please set ${SQLConf.FILE_SINK_LOG_COMPACT_INTERVAL.key} (was 
$defaultCompactInterval) " +
   "to a positive value.")
+
+  private val ttlMs = outputTimeToLiveMs.getOrElse(Long.MaxValue)
+
+  override def shouldRetain(log: SinkFileStatus): Boolean = {
+val curTime = System.currentTimeMillis()

Review comment:
   It would be great to avoid calling `System.currentTimeMillis()` if the 
option is not set, considering we need to call this method once (a JNI call) 
for each log entry.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##
@@ -136,8 +136,9 @@ class FileStreamSink(
   private val basePath = new Path(path)
   private val logPath = getMetadataLogPath(basePath.getFileSystem(hadoopConf), 
basePath,
 sparkSession.sessionState.conf)
-  private val fileLog =
-new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, 
logPath.toString)
+  private val outputTimeToLive = options.get("outputRetentionMs").map(_.toLong)

Review comment:
   Can we use `Utils.timeStringAsMs` to parse this? Users likely set this 
to multiple days and asking them to calculate milliseconds is not user friendly.
   
   Nit: regarding the option name, can we call it `retention`? It's obvious 
that the query is outputting files, so `output` sounds redundant to me.

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala
##
@@ -136,8 +136,9 @@ class FileStreamSink(
   private val basePath = new Path(path)
   private val logPath = getMetadataLogPath(basePath.getFileSystem(hadoopConf), 
basePath,
 sparkSession.sessionState.conf)
-  private val fileLog =
-new FileStreamSinkLog(FileStreamSinkLog.VERSION, sparkSession, 
logPath.toString)
+  private val outputTimeToLive = options.get("outputRetentionMs").map(_.toLong)

Review comment:
   Nit: any reason to use a different name `outputTimeToLive`? Using the 
same name as the option would help other people read codes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30492: [SPARK-33545][CORE] Support Fallback Storage during Worker decommission

2020-11-28 Thread GitBox


dongjoon-hyun commented on a change in pull request #30492:
URL: https://github.com/apache/spark/pull/30492#discussion_r532122517



##
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##
@@ -627,7 +627,16 @@ private[spark] class BlockManager(
   override def getLocalBlockData(blockId: BlockId): ManagedBuffer = {
 if (blockId.isShuffle) {
   logDebug(s"Getting local shuffle block ${blockId}")
-  shuffleManager.shuffleBlockResolver.getBlockData(blockId)
+  try {
+shuffleManager.shuffleBlockResolver.getBlockData(blockId)
+  } catch {
+case e: IOException =>
+  if 
(conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined) {
+FallbackStorage.read(conf, blockId)
+  } else {
+throw e
+  }

Review comment:
   It's already `IOException` and `FallbackStorage.read` throws 
`IOException` for non-exist files and it's fine and legitimate in the 
`WorkerDecomission` context, @viirya . Please note the following.
   - The whole Worker decommission (including `shuffle and rdd storage 
decommission`) is designed as a best-effort approach.
   - It's because the main use case is K8s graceful shutdown with the default 
period (`30s`). We can increase it, but we cannot set it infinite value 
technically. It means executor hangs in case of disk full situation.
   - What we are aiming is to rescue data as much as possible, but `100%` is 
not guaranteed always.
   - Lastly, data block selection(shuffle or rdd) was a random-order from the 
beginning. It became worse when there are multiple shuffle across multiple 
executors.
   
   Due to the above reasons, we introduced `SPARK-33387 Support ordered shuffle 
block migration` to rescue as complete as possible. However, we can still 
easily imagine that multiple shuffles coexist on an executor in a skewed 
manner. While the other executor succeeds to complete the migration, the skewed 
executor may fail to complete the migration in the given grace period.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >