[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-08-28 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/14112
  
@GayathriMurali Apologies for the long delay!  It slipped past the release, 
and I'm trying to catch up on PRs now.  I'll make a final review pass ASAP


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14112: [SPARK-16240][ML] Model loading backward compatibility f...

2016-08-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14112
  
**[Test build #3234 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3234/consoleFull)**
 for PR 14112 at commit 
[`b16b368`](https://github.com/apache/spark/commit/b16b3682a50486a81913ce5b5b67711d73b7589f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14831: [SPARK-17255] [Core] Spark queries inside Futures...

2016-08-28 Thread jukkavisma
Github user jukkavisma closed the pull request at:

https://github.com/apache/spark/pull/14831


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14831: [SPARK-17255] [Core] Spark queries inside Futures occasi...

2016-08-28 Thread jukkavisma
Github user jukkavisma commented on the issue:

https://github.com/apache/spark/pull/14831
  
JoshRosen, you are right. That solved my problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-08-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14452
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...

2016-08-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14803
  
ping @zsxwing @tdas Can you help review this? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14531: [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in C...

2016-08-28 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14531
  
In Hive, `CREATE TABLE LIKE` only copies from the original.source table the 
table properties that are specified in the Hive configuration: 
`hive.ddl.createtablelike.properties.whitelist`. The default value of this 
configuration is empty.

See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3337 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14855: [SPARK-17284] [SQL] Remove Statistics-related Tab...

2016-08-28 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14855#discussion_r76554157
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -794,8 +794,10 @@ case class ShowCreateTableCommand(table: 
TableIdentifier) extends RunnableComman
   private def showHiveTableProperties(metadata: CatalogTable, builder: 
StringBuilder): Unit = {
 if (metadata.properties.nonEmpty) {
   val filteredProps = metadata.properties.filterNot {
-// Skips "EXTERNAL" property for external tables
-case (key, _) => key == "EXTERNAL" && metadata.tableType == 
EXTERNAL
+// Skips all the stats info (See the JIRA: HIVE-13792)
+case (key, _) =>
+  key == "numFiles" || key == "numRows" || key == "totalSize" || 
key == "numPartitions" ||
--- End diff --

Sure, let me change it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14388: [SPARK-16362][SQL] Support ArrayType and StructType in v...

2016-08-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14388
  
ping @maver1ck 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13775: [SPARK-16060][SQL] Vectorized Orc reader

2016-08-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/13775
  
ping @yhuai @liancheng @hvanhovell @cloud-fan Can you take a look at this? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org




[GitHub] spark issue #14780: [SPARK-17206][SQL] Support ANALYZE TABLE on analyzable t...

2016-08-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14780
  
@hvanhovell @cloud-fan Can you help review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14784: [SPARK-17210][SPARKR] sparkr.zip is not distribut...

2016-08-28 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14784#discussion_r76553829
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -365,6 +365,10 @@ sparkR.session <- function(
 }
 overrideEnvs(sparkConfigMap, paramMap)
   }
+  if (nzchar(master)) {
+assign("spark.master", master, envir = sparkConfigMap)
--- End diff --

@felixcheung sorry, I don't get what you mean.  L357 will override whatever 
`master` is, that's why I make the change after that, so that we can pass the 
correct master to sparkConfigMap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14835: [SPARK-17243] [Web UI] Spark 2.0 History Server won't lo...

2016-08-28 Thread ajbozarth
Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/14835
  
@QQshu1 I'm not sure what problem you're having but this pr is to solve the 
specific problem that the call to get the application list json when loading 
the history server takes too long when the application list is too long


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...

2016-08-28 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14712
  
yeah I think we should focus in the top priority target in this pr. And the 
hive related translation can be addressed in later prs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14856: [SPARK-17241][SparkR][MLlib] SparkR spark.glm sho...

2016-08-28 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14856#discussion_r76553172
  
--- Diff: R/pkg/R/mllib.R ---
@@ -171,7 +172,8 @@ predict_internal <- function(object, newData) {
 #' @note spark.glm since 2.0.0
 #' @seealso \link{glm}, \link{read.ml}
 setMethod("spark.glm", signature(data = "SparkDataFrame", formula = 
"formula"),
-  function(data, formula, family = gaussian, tol = 1e-6, maxIter = 
25, weightCol = NULL) {
+  function(data, formula, family = gaussian, tol = 1e-6, regParam 
= 0.0, maxIter = 25,
+   weightCol = NULL) {
--- End diff --

If say an R user call the function by `spark.glm(df, label ~ feature, 
gaussian, 1e-6, 25)`. This will break their code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14855: [SPARK-17284] [SQL] Remove Statistics-related Tab...

2016-08-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14855#discussion_r76552894
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -794,8 +794,10 @@ case class ShowCreateTableCommand(table: 
TableIdentifier) extends RunnableComman
   private def showHiveTableProperties(metadata: CatalogTable, builder: 
StringBuilder): Unit = {
 if (metadata.properties.nonEmpty) {
   val filteredProps = metadata.properties.filterNot {
-// Skips "EXTERNAL" property for external tables
-case (key, _) => key == "EXTERNAL" && metadata.tableType == 
EXTERNAL
+// Skips all the stats info (See the JIRA: HIVE-13792)
+case (key, _) =>
+  key == "numFiles" || key == "numRows" || key == "totalSize" || 
key == "numPartitions" ||
--- End diff --

Is it possible that the list of hidden properties will grow in future? If 
so, can we not add them with `||` here? A separated list like the 
`excludedTableProperties` below seems good. And we can check if the key is in 
the list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14856: [SPARK-17241][SparkR][MLlib] SparkR spark.glm sho...

2016-08-28 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/14856#discussion_r76552881
  
--- Diff: R/pkg/R/mllib.R ---
@@ -171,7 +172,8 @@ predict_internal <- function(object, newData) {
 #' @note spark.glm since 2.0.0
 #' @seealso \link{glm}, \link{read.ml}
 setMethod("spark.glm", signature(data = "SparkDataFrame", formula = 
"formula"),
-  function(data, formula, family = gaussian, tol = 1e-6, maxIter = 
25, weightCol = NULL) {
+  function(data, formula, family = gaussian, tol = 1e-6, regParam 
= 0.0, maxIter = 25,
+   weightCol = NULL) {
--- End diff --

check the `fit()` method of the wrapper, as long as the parameter order 
matches, it's ok.

I've tested it already in R terminal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14856: [SPARK-17241][SparkR][MLlib] SparkR spark.glm sho...

2016-08-28 Thread junyangq
Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14856#discussion_r76552745
  
--- Diff: R/pkg/R/mllib.R ---
@@ -171,7 +172,8 @@ predict_internal <- function(object, newData) {
 #' @note spark.glm since 2.0.0
 #' @seealso \link{glm}, \link{read.ml}
 setMethod("spark.glm", signature(data = "SparkDataFrame", formula = 
"formula"),
-  function(data, formula, family = gaussian, tol = 1e-6, maxIter = 
25, weightCol = NULL) {
+  function(data, formula, family = gaussian, tol = 1e-6, regParam 
= 0.0, maxIter = 25,
+   weightCol = NULL) {
--- End diff --

Perhaps we can add that to the end of the argument list so that it doesn't 
break the existing calls to the function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14857: [SPARK-17261][PYSPARK] Using HiveContext after re...

2016-08-28 Thread zjffdu
GitHub user zjffdu opened a pull request:

https://github.com/apache/spark/pull/14857

[SPARK-17261][PYSPARK] Using HiveContext after re-creating SparkContext in 
Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a 
stopped sparkContext"

## What changes were proposed in this pull request?

Set SparkSession._instantiatedContext as None so that we can recreate 
SparkSession again. 


## How was this patch tested?

Tested manually using the following command in pyspark shell
```
spark.stop()
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.sql("show databases").show()
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-17261

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14857.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14857


commit 986a24fab27e258f263590a2e55cb88c0f8a662a
Author: Jeff Zhang 
Date:   2016-08-29T03:36:51Z

[SPARK-17261][PYSPARK] Using HiveContext after re-creating SparkContext in 
Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a 
stopped sparkContext"




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14298: [SPARK-16283][SQL] Implement `percentile_approx` ...

2016-08-28 Thread lw-lin
Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14298#discussion_r76548403
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala
 ---
@@ -0,0 +1,462 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.QuantileSummaries.Stats
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+
+/**
+ * Computes an approximate percentile (quantile) using the G-K algorithm 
(see below), for very
+ * large numbers of rows where the regular percentile() UDAF might run out 
of memory.
+ *
+ * The input is a single double value or an array of double values 
representing the percentiles
+ * requested. The output, corresponding to the input, is either a single 
double value or an
+ * array of doubles that are the percentile values.
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(col, p [, B]) - Returns an approximate pth percentile 
of a numeric column in the
+ group. The B parameter, which defaults to 1000, controls 
approximation accuracy at the cost of
+ memory; higher values yield better approximations.
+_FUNC_(col, array(p1 [, p2]...) [, B]) - Same as above, but accepts 
and returns an array of
+ percentile values instead of a single one.
+""")
+case class PercentileApprox(
+child: Expression,
+percentilesExpr: Expression,
+bExpr: Option[Expression],
+percentiles: Seq[Double],  // the extracted percentiles
+B: Int,// the extracted B
+resultAsArray: Boolean,// whether to return the result as an array
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0) extends ImperativeAggregate {
+
+  private def this(child: Expression, percentilesExpr: Expression, bExpr: 
Option[Expression]) = {
+this(
+  child = child,
+  percentilesExpr = percentilesExpr,
+  bExpr = bExpr,
+  // validate and extract percentiles
+  percentiles = 
PercentileApprox.validatePercentilesLiteral(percentilesExpr)._1,
+  // validate and extract B
+  B = 
bExpr.map(PercentileApprox.validateBLiteral(_)).getOrElse(PercentileApprox.B_DEFAULT),
+  // validate and mark whether we should return results as array of 
double or not
+  resultAsArray = 
PercentileApprox.validatePercentilesLiteral(percentilesExpr)._2)
+  }
+
+  // Constructor for the "_FUNC_(col, p) / _FUNC_(col, array(p1, ...))" 
form
+  def this(child: Expression, percentilesExpr: Expression) = {
+this(child, percentilesExpr, None)
+  }
+
+  // Constructor for the "_FUNC_(col, p, B) / _FUNC_(col, array(p1, ...), 
B)" form
+  def this(child: Expression, percentilesExpr: Expression, bExpr: 
Expression) = {
+this(child, percentilesExpr, Some(bExpr))
+  }
+
+  override def prettyName: String = "percentile_approx"
+
+  override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: 
Int): ImperativeAggregate =
+copy(mutableAggBufferOffset = newMutableAggBufferOffset)
+
+  override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
ImperativeAggregate =
+copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def children: Seq[Expression] =
+bExpr.map(child :: percentilesExpr :: _ :: Nil).getOrElse(child :: 
percentilesExpr :: Nil)
+
+  // we would return null for empty inputs
+  override def nullable: Boolean = true
+
+  override def dataType: DataType = if (resultAsArray) 
ArrayType(Double

[GitHub] spark issue #14807: [SPARK-17256][Deploy, Windows]Check before adding double...

2016-08-28 Thread qualiu
Github user qualiu commented on the issue:

https://github.com/apache/spark/pull/14807
  
@tritab  : Thanks for your reply! Yes, I had tried that last week but not 
forgot to put them on the context.
(1) Has space in `spark-submit.cmd` full path.
(2) Cut off quoted argument no matter there's space & even if escaped and 
with double quotes.

![image](https://cloud.githubusercontent.com/assets/18525294/18038889/d63e4738-6dcf-11e6-9c19-3799d5696020.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

2016-08-28 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14712#discussion_r76545213
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -207,9 +207,11 @@ class FindDataSourceTable(sparkSession: SparkSession) 
extends Rule[LogicalPlan]
 className = table.provider.get,
 options = table.storage.properties)
 
-LogicalRelation(
+val logicalRel = LogicalRelation(
   dataSource.resolveRelation(),
   metastoreTableIdentifier = Some(table.identifier))
+logicalRel.inheritedStats = table.catalogStats
--- End diff --

Do you mean adding statistics as a constructor parameter in 
LogicalRelation? I tried this in my first implementation, but many code paths 
needed to be fixed (pattern match like LogicalRelation(src: BaseRelation, _, 
_)). I'll switch back if that's ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14807: [SPARK-17256][Deploy, Windows]Check before adding double...

2016-08-28 Thread tritab
Github user tritab commented on the issue:

https://github.com/apache/spark/pull/14807
  
Does it work if you escape the internal  jdbc quotes with a caret ^ ?

On Aug 28, 2016 8:33 PM, "Quanmao LIU"  wrote:

> @tsudukim  : I did the validation mentioned
> above and snap the picture as following:
> [image: image]
> 

>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14807: [SPARK-17256][Deploy, Windows]Check before adding double...

2016-08-28 Thread qualiu
Github user qualiu commented on the issue:

https://github.com/apache/spark/pull/14807
  
@tsudukim : I did the validation mentioned above and snap the picture as 
following:

![image](https://cloud.githubusercontent.com/assets/18525294/18038476/4bbbdea8-6dcb-11e6-8dd3-07cae45cdada.png)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-08-28 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r76543182
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -54,9 +54,6 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 intercept[IllegalArgumentException] {
   formula.fit(original)
 }
-intercept[IllegalArgumentException] {
--- End diff --

here is just a duplication of above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

2016-08-28 Thread keypointt
Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/13584#discussion_r76543133
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/r/RWrapperUtilsSuite.scala ---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.feature.{RFormula, RFormulaModel}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+class RWrapperUtilsSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  test("avoid column name conflicting") {
+val rFormula = new RFormula().setFormula("label ~ features")
+val data = 
spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")
--- End diff --

Here I used `"../data/"`, I'm not sure if there is a better way to do it, 
something like `$current_directory/data/mllib/sample_libsvm_data.txt`?

All I found is like this `val data = 
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")` 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala#L36


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14856: [SPARK-17241][SparkR][MLlib] SparkR spark.glm sho...

2016-08-28 Thread keypointt
GitHub user keypointt opened a pull request:

https://github.com/apache/spark/pull/14856

[SPARK-17241][SparkR][MLlib] SparkR spark.glm should have configurable 
regularization parameter

https://issues.apache.org/jira/browse/SPARK-17241

## What changes were proposed in this pull request?

Spark has configurable L2 regularization parameter for generalized linear 
regression. It is very important to have them in SparkR so that users can run 
ridge regression.

## How was this patch tested?

Test manually on local laptop.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/keypointt/spark SPARK-17241

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14856.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14856


commit 6417049e9185434bc23c651217d73a88abe4f606
Author: Xin Ren 
Date:   2016-08-28T23:01:37Z

[SPARK-17241] add configurable regularization parameter




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14855: [SPARK-17284] [SQL] Remove Statistics-related Tab...

2016-08-28 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/14855

[SPARK-17284] [SQL] Remove Statistics-related Table Properties from SHOW 
CREATE TABLE

### What changes were proposed in this pull request?
The statistics-related table properties should be skipped by ```SHOW CREATE 
TABLE```, since it could be incorrect in the newly created table. See the Hive 
JIRA: https://issues.apache.org/jira/browse/HIVE-13792

```SQL
CREATE TABLE t1 (
  c1 INT COMMENT 'bla',
  c2 STRING
)
LOCATION '$dir'
TBLPROPERTIES (
  'prop1' = 'value1',
  'prop2' = 'value2'
)
```
The output of ```SHOW CREATE TABLE t1``` is 

```SQL
CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 
'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-ee317538-0f8c-42d0-b08c-cf077d94fe75'
TBLPROPERTIES (
  'rawDataSize' = '-1',
  'numFiles' = '0',
  'transient_lastDdlTime' = '1472424052',
  'totalSize' = '0',
  'prop1' = 'value1',
  'prop2' = 'value2',
  'COLUMN_STATS_ACCURATE' = 'false',
  'numRows' = '-1'
)
```

After the fix, the output becomes
```SQL
CREATE EXTERNAL TABLE `t1`(`c1` int COMMENT 'bla', `c2` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 
'file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/spark-74058a6d-db8b-41c1-9bda-bd449f1a78ed'
TBLPROPERTIES (
  'transient_lastDdlTime' = '1472423603',
  'prop1' = 'value1',
  'prop2' = 'value2'
)
```

### How was this patch tested?
Updated the existing test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark showCreateTable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14855.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14855


commit 92474c5a142fb9db2c86549c8347f910fc01fcbd
Author: gatorsmile 
Date:   2016-08-28T22:28:15Z

remove stats-related props




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14741: [SPARK-6832][SPARKR][WIP]Handle partial reads in SparkR

2016-08-28 Thread krishnakalyan3
Github user krishnakalyan3 commented on the issue:

https://github.com/apache/spark/pull/14741
  
@shivaram thanks for the advice.

Some Issue being faced by me
- While reading a large file from Rstudio and trying to kill the the 
process using `Sys.getpid()`, I tried to interrupt the process using the 
signals `pskill(pid, signal = SIGUSR1/SIGCHLD)`. This does not seem to affect 
my R session and does not print `Interrupt` (As per my code below).
```
readBinFully <- function(con, what, n = 1L, size  = NA_integer_, endian) {
  while (n > 0) {
if (con == 0) {
  cat("Interrupt")
}
readBin(con, what, n, size, endian = "big")
  }
}
```
- `sparkr.zip` obtained after running `install-dev.sh` does not seem to 
reflect the changes made in my R session. (Restarting R studio solves this 
problem).  Code below
```
rm(list=ls())
Sys.setenv(SPARK_HOME="/Users/krishna/Experiment/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
```




- To check if there are more bytes to read, I have tried the code below. 
Which fails the tests in `run-tests.sh`
```
while (size > 0) {
...
}
```
I see that variable size takes the value `NA`.

Please advice on how I should be approaching these issues.

Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14444: [SPARK-16839] [SQL] redundant aliases after cleanupAlias...

2016-08-28 Thread eyalfa
Github user eyalfa commented on the issue:

https://github.com/apache/spark/pull/1
  
@hvanhovell, after a quick 'find references' on CreateStruct it seems that 
there are many places whre its constructor being used, but only numerous places 
where it's being pattern-matched.
I'm thinking about replacing the class with an object and an apply method 
that keeps compatibility with most existing code, then manually modify code 
using pattern matching on CreateStruct.

please let me know if this strategy is acceptable.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14204: [SPARK-16520] [WEBUI] Link executors to corresponding wo...

2016-08-28 Thread nblintao
Github user nblintao commented on the issue:

https://github.com/apache/spark/pull/14204
  
test this, please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14854: [SPARK-17283][WIP][Core] Cancel job in RDD.take()...

2016-08-28 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14854#discussion_r76537638
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala ---
@@ -299,9 +299,17 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging with Serializ
   /**
* Runs this query returning the first `n` rows as an array.
*
-   * This is modeled after RDD.take but never runs any job locally on the 
driver.
--- End diff --

This comment was outdated now that we have removed driver-local execution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14854: [SPARK-17283][WIP][Core] Cancel job in RDD.take()...

2016-08-28 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14854#discussion_r76537634
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -148,6 +148,12 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val ONLINE_TAKE_ENABLED = 
SQLConfigBuilder("spark.sql.onlineTake.enabled")
--- End diff --

This is the first name that came to mind for this feature-flag, but I'm 
sure there's a clearer alternative. Please suggest.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14854: [SPARK-17283][WIP][Core] Cancel job in RDD.take()...

2016-08-28 Thread JoshRosen
GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/14854

[SPARK-17283][WIP][Core] Cancel job in RDD.take() as soon as enough output 
is received

## What changes were proposed in this pull request?

This patch introduces a new internal implementation of `RDD.take()` which 
halts the Spark job as soon as enough rows have been received. This can offer 
substantial performance improvements for interactive queries. For example, 
consider

```
sc.parallelize(1 to 10, 10).filter(_ > 1).take(1)
```

In the current `take()` implementation, this will end up computing almost 
every partition of the RDD even though only the first 10% of partitions must be 
computed to return the desired output. With this patch's changes, Spark will 
cancel the running job as soon as enough output has been received, greatly 
improving responsiveness for limit queries with highly-selective filters.

Although the changes here are implemented in Spark core, they also benefit 
Spark SQL; the following query also observes similar speedups:

```
sc.parallelize(1 to 10, 10).toDF.filter("value > 1").take(1)
```

**Caveat:** the performance optimizations implemented here do not apply to 
`AsyncRDDActions.takeAsync()`. The `takeAsync()` codepath is fairly complicated 
and, to the best of my knowledge, is not very widely used, so I'd like to 
postpone any changes to it.

This patch is marked as WIP pending tests and a few design details that I'd 
like to discuss (will comment inline).

### Implementation details

**Task result buffering:** this new implementation of `take()` returns 
exactly the same answers as the old implementation. The old implementation 
first collected the per-partition results from all partitions and then 
processed them in ascending order of partition id. In order to match this 
behavior, the new implementation carefully buffers task outputs in order to 
process them in order. The code is extensively commented to explain how its 
bookkeeping logic allows this to be done efficiently.

**Scheduler changes:** the existing `FutureAction.cancel()` method causes 
the Spark scheduler to mark the task as failed, which doesn't make sense in 
this context. To address this, this patch modifies the DAGScheduler to allow 
jobs to be cancelled while still marking them as successful. This is a slightly 
tricky change that will require discussion and more tests (to be added here).

### Feature flags

I have made this new implementation the default but have preserved the old 
`take()` implementations behind feature-flags.

In Spark Core, the old `take()` implementation can be restored by setting 
`spark.driver.useOldTakeImplementation=true` in SparkConf.

In Spark SQL, the old implementation can be used by setting 
`spark.sql.onlineTake.enabled=false` in SQLConf.

## How was this patch tested?

TODO; tests pending. I plan to add the following tests:

- Ensure that the driver thread (the thread which calls `take()`) is 
interrupted / unblocked in case the user cancels the active `take()` job via 
the web UI or via a job group.
- Ensure that the console progress bar disappears after the job halts early.
- Add separate unit tests for the new "cancel without failing job" 
scheduler functionality.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark take-early-stopping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14854.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14854


commit 4c625788c3bd7c7c649937f95887653cd2cfc93f
Author: Josh Rosen 
Date:   2016-08-27T21:30:08Z

Initial commit of optimized take().

commit 0cd97eebeb0c6cb89fae850ef7e5299b1ddab480
Author: Josh Rosen 
Date:   2016-08-27T22:25:17Z

Fix negated feature flag.

commit ef5e9f73ea9cbe8aa71e781fc1cca0cb22957bc0
Author: Josh Rosen 
Date:   2016-08-28T18:08:49Z

Add scheduler path for canceling job without failing it.

commit 75640e2a08c41f571bacdd5fb7ef75e67eb62226
Author: Josh Rosen 
Date:   2016-08-28T18:44:48Z

Comment update.

commit 0751d776c19711e5c576ada8b46bfd79a5f50348
Author: Josh Rosen 
Date:   2016-08-28T19:16:16Z

Preserve partition-local limits in RDD take().




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-un

[GitHub] spark issue #14444: [SPARK-16839] [SQL] redundant aliases after cleanupAlias...

2016-08-28 Thread eyalfa
Github user eyalfa commented on the issue:

https://github.com/apache/spark/pull/1
  
@hvanhovell , if you examine the diffs in this pr, you'll see that 
ExpressionEncoder 
[uses](https://github.com/apache/spark/pull/1/files#diff-91c617f2464cea010922328f4cdbbda9R136)
 CreateStruct to encode tuples or case classes ('deep' cases if I understand 
the exact use-case). these encoding expressions are sometimes evaluated without 
being put through the resolver (please see 
[this](https://github.com/apache/spark/pull/1#issuecomment-241285908) 
comment).

anyway, are you absolutely sure CreateStruct is not a public API?
in case we actually remove it/turn it into a factory for the named version, 
I guess it's bye-bye to the newly introduced resolver rule, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in OR...

2016-08-28 Thread xwu0226
Github user xwu0226 commented on the issue:

https://github.com/apache/spark/pull/14842
  
I see what you mean. Let me try your approach. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14841: [SPARK-17271] [SQL] Planner adds un-necessary Sor...

2016-08-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14841


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14841: [SPARK-17271] [SQL] Planner adds un-necessary Sort even ...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14841
  
LGTM - merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76535329
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala
 ---
@@ -36,11 +40,19 @@ case object Descending extends SortDirection {
   override def sql: String = "DESC"
 }
 
+case object NullFirst extends NullOrdering{
+  override def sql: String = "NULLS FIRST"
+}
+
+case object NullLast extends NullOrdering{
+  override def sql: String = "NULLS LAST"
+}
+
 /**
  * An expression that can be used to sort a tuple.  This class extends 
expression primarily so that
  * transformations over expression will descend into its child.
  */
-case class SortOrder(child: Expression, direction: SortDirection)
+case class SortOrder(child: Expression, direction: SortDirection, 
nullOrder: NullOrdering = null)
--- End diff --

Yeah, you are right about that. Lets keep the default behavior, see my 
(more general) comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in OR...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14842
  
Ok, so ASCENDING and DESCENDING have different NULLS behavior:
- ASCENDING: NULLS FIRST
- DESCENDING: NULLS LAST

I really like to avoid `nulls` being passed around and I also like to avoid 
a third case we need to deal with. My suggestion would be:

- Only deal with the ASC NULLS FIRST/ASC NULLS LAST/DESC NULLS FIRST/DESC 
NULL LAST scenarios in prefix sort (and to drop the null scenario).  
- Encode the null default behavior in the `SortDirection` and make the 
default nullOrdering parameter in `SortOrder` use that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14618: [SPARK-17030] [SQL] Remove/Cleanup HiveMetastoreCatalog....

2016-08-28 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14618
  
@cloud-fan Let me separate the cleanup task to multiple smaller PRs. We can 
decide which ones can be merged at first. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76534922
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2661,4 +2661,186 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 data.selectExpr("`part.col1`", "`col.1`"))
 }
   }
+
+  test("SPARK-10747: ORDER BY clause with NULLS FIRST|LAST in WINDOW 
specification") {
--- End diff --

Yes. I will do that. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...

2016-08-28 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14712
  
@wzhfy Yeah, it sounds good to me to split the whole problem into multiple 
PRs. 

@hvanhovell Sure, let me create the JIRA and I can work on this when the 
other dependent JIRAs are resolved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76534903
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala
 ---
@@ -36,11 +40,19 @@ case object Descending extends SortDirection {
   override def sql: String = "DESC"
 }
 
+case object NullFirst extends NullOrdering{
+  override def sql: String = "NULLS FIRST"
+}
+
+case object NullLast extends NullOrdering{
+  override def sql: String = "NULLS LAST"
+}
+
 /**
  * An expression that can be used to sort a tuple.  This class extends 
expression primarily so that
  * transformations over expression will descend into its child.
  */
-case class SortOrder(child: Expression, direction: SortDirection)
+case class SortOrder(child: Expression, direction: SortDirection, 
nullOrder: NullOrdering = null)
--- End diff --

Without giving default value for this parameter, there are about no more 
than 100 lines of code change in different places, including test cases. 
Setting "nulls first" as default behavior may break a lot of test cases that 
put nulls first when ASC, and last when DESC. The original default behavior 
will change. If this is expected, I will make the change. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in OR...

2016-08-28 Thread xwu0226
Github user xwu0226 commented on the issue:

https://github.com/apache/spark/pull/14842
  
@hvanhovell Thank you so much for reviewing and providing the suggestions. 
I will separate this into 2 PRs. 
For the question of why we need 3 types of NULL ordering, I was thinking 
that the default behavior of NULL Ordering is following the column ordering 
ASC|DESC, where NULL is considered always smaller than the NON-null values. So 
nulls will be returned first for ASC case, and last for DESC case. In order not 
to break any existing test cases and behavior, I chose to have a nullOrder 
undefined case, where it goes to the existing behavior. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76534728
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/unsafe/sort/RadixSortSuite.scala
 ---
@@ -52,6 +52,7 @@ class RadixSortSuite extends SparkFunSuite with Logging {
   new PrefixComparators.RadixSortSupport {
--- End diff --

I will add test cases for this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...

2016-08-28 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/14712
  
I like @hvanhovell 's proposal, since providing a perfect Hive translation 
layer is not trivial based on @gatorsmile 's investigation - we need to deal 
with different versions of Hive. It is better to be decoupled from this pr. 
What do you think? @gatorsmile @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14807: [SPARK-17256][Deploy, Windows]Check before adding double...

2016-08-28 Thread tsudukim
Github user tsudukim commented on the issue:

https://github.com/apache/spark/pull/14807
  
Hi @qualiu, I had a quick look.
I believe spark-submit.cmd that contains space in its path worked fine when 
#10789 is merged, so I wonder if it is the problem of `cmd /V /E /C`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14619: [SPARK-17031][SQL] Add `Scanner` operator to wrap the op...

2016-08-28 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/14619
  
@cloud-fan I've moved the `InsertRelationScanner` rule to `Analyzer`, after 
relations and expressions are resolved. To reuse analyze and optimize rules, I 
updated relative rules such as `CleanupAliases`、 `ColumnPruning`、 
`PushDownPredicate`、 `InferFiltersFromConstraints`、 
`ConvertToLocalRelation`、 `PropagateEmptyRelation`, I also added new rules to 
combine and prune `Scanner` operators. Besides, I made some change in subquery 
related rules and recently found they have been refactored.
Now that only a few of test cases is still failing, which should be easy to 
fix. But, I realized adding a wrapper node over every relation maybe not a idea 
that is perfect enough for the following reasons:
Firstly, scan a relation is not among basic operators in SQL language, when 
we declare a relation, we imply it should be scanned, so It seems semantically 
duplicate to declare a `Scanner` node over a relation or calling 
`relation.scanner()`. Besides, to add this wrapper node, we have to make a new 
assumption that no other operators should be inserted between `Scanner` and its 
corresponding relation, this brought in more complexity.
Secondly, a wrapper node should contain the output, predicates that can be 
used in partition pruning, and a relation to be scanned. But this may cause 
complex situation in some cases, for example, in `InferFiltersFromConstraints`, 
we have to covert expression in filters to alias name when we collect valid 
constraints, because output maybe alias and filters have to use child 
expression, this behavor is not needed in other operators.
At last, I feel adding such a operator have caused too many changes, 
perhaps we should make some improvement on `PhysicalOperation`, until we figure 
out a way comprehensively better than current method.

After all, I'm passionate to this improvement and will try my best to 
contribute, please correct me if I'm wrong, thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14712#discussion_r76533896
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -401,6 +401,13 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 }
   }
 
+  override def alterTableStats(tableDefinition: CatalogTable): Unit = 
withClient {
--- End diff --

See my previous comment. Lets first get this right on the Spark SQL side, 
and then work towards Hive compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14712#discussion_r76533832
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -207,9 +207,11 @@ class FindDataSourceTable(sparkSession: SparkSession) 
extends Rule[LogicalPlan]
 className = table.provider.get,
 options = table.storage.properties)
 
-LogicalRelation(
+val logicalRel = LogicalRelation(
   dataSource.resolveRelation(),
   metastoreTableIdentifier = Some(table.identifier))
+logicalRel.inheritedStats = table.catalogStats
--- End diff --

Shouldn't we just add all the table metadata to the LogicalRelation? Then 
we don't have to assign a statistics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14712#discussion_r76533586
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -88,14 +87,66 @@ case class AnalyzeTableCommand(tableName: String) 
extends RunnableCommand {
 }
   }.getOrElse(0L)
 
-// Update the Hive metastore if the total size of the table is 
different than the size
-// recorded in the Hive metastore.
-// This logic is based on 
org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
-if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
+var needUpdate = false
+val totalSize = if (newTotalSize > 0 && newTotalSize != 
oldTotalSize) {
+  needUpdate = true
+  newTotalSize
+} else {
+  oldTotalSize
+}
+var numRows: Option[BigInt] = None
+if (!noscan) {
+  val oldRowCount = 
catalogStats.flatMap(_.rowCount.map(_.toLong)).getOrElse(-1L)
+  val newRowCount = sparkSession.table(tableName).count()
--- End diff --

We can also use the passed plan, and skip a lookup in the spark session. 
`Dataset.ofRows(sparkSession, relation).count`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14712#discussion_r76533558
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -88,14 +87,66 @@ case class AnalyzeTableCommand(tableName: String) 
extends RunnableCommand {
 }
   }.getOrElse(0L)
 
-// Update the Hive metastore if the total size of the table is 
different than the size
-// recorded in the Hive metastore.
-// This logic is based on 
org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
-if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
+var needUpdate = false
+val totalSize = if (newTotalSize > 0 && newTotalSize != 
oldTotalSize) {
+  needUpdate = true
+  newTotalSize
+} else {
+  oldTotalSize
+}
+var numRows: Option[BigInt] = None
--- End diff --

Is it ok to remove the `numRows` statistics if we issue a `ANALYZE TABLE 
tbl_a NOSCAN`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14712#discussion_r76533400
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -88,14 +89,70 @@ case class AnalyzeTableCommand(tableName: String) 
extends RunnableCommand {
 }
   }.getOrElse(0L)
 
-// Update the Hive metastore if the total size of the table is 
different than the size
-// recorded in the Hive metastore.
-// This logic is based on 
org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
-if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
+var needUpdate = false
+val totalSize = if (newTotalSize > 0 && newTotalSize != 
oldTotalSize) {
+  needUpdate = true
+  newTotalSize
+} else {
+  oldTotalSize
+}
+var numRows: Option[BigInt] = None
+if (!noscan) {
+  val oldRowCount: Long = if (catalogTable.catalogStats.isDefined) 
{
+
catalogTable.catalogStats.get.rowCount.map(_.toLong).getOrElse(-1L)
+  } else {
+-1L
+  }
+  val newRowCount = sparkSession.table(tableName).count()
+  if (newRowCount >= 0 && newRowCount != oldRowCount) {
+numRows = Some(BigInt(newRowCount))
+needUpdate = true
+  }
+}
+// Update the metastore if the above statistics of the table are 
different from those
+// recorded in the metastore.
+if (needUpdate) {
+  sessionState.catalog.alterTable(
+catalogTable.copy(
+  catalogStats = Some(Statistics(
+sizeInBytes = totalSize, rowCount = numRows))),
+fromAnalyze = true)
+
+  // Refresh the cache of the table in the catalog.
+  sessionState.catalog.refreshTable(tableIdent)
+}
+
+  // data source tables have been converted into LogicalRelations
+  case logicalRel: LogicalRelation if 
logicalRel.metastoreTableIdentifier.isDefined =>
--- End diff --

Ok, that is fair. Could you get rid of the duplicated code though?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14840: [SPARK-16216][SQL][FOLLOWUP][BRANCH-2.0] Bacoport enabli...

2016-08-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14840
  
Yes, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14840: [SPARK-16216][SQL][FOLLOWUP][BRANCH-2.0] Bacoport...

2016-08-28 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/14840


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14712
  
@gatorsmile lets also create another ticket for the explicit statistics 
updates. I do like the `alter table s update statistics set ...` option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14712
  
How about we use our own property names for now, and provide a Hive 
translation layer in a different PR. IMO it is fine to break a little bit of 
the behavior in master as long as we fix it (or make a well founded choice not 
to) before 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14840: [SPARK-16216][SQL][FOLLOWUP][BRANCH-2.0] Bacoport enabli...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14840
  
@HyukjinKwon could you close this PR (the merge script cannot do this).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14840: [SPARK-16216][SQL][FOLLOWUP][BRANCH-2.0] Bacoport enabli...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14840
  
LGTM. Merging to 2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76532925
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2661,4 +2661,186 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 data.selectExpr("`part.col1`", "`col.1`"))
 }
   }
+
+  test("SPARK-10747: ORDER BY clause with NULLS FIRST|LAST in WINDOW 
specification") {
--- End diff --

We also need a test for the generic case, e.g.: `SELECT * FROM tbl_a ORDER 
BY id NULLS LAST`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76532909
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2661,4 +2661,186 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 data.selectExpr("`part.col1`", "`col.1`"))
 }
   }
+
+  test("SPARK-10747: ORDER BY clause with NULLS FIRST|LAST in WINDOW 
specification") {
--- End diff --

Could you test this using SQLQueryTestSuite? That is much more concise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76532670
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala
 ---
@@ -58,7 +70,8 @@ case class SortOrder(child: Expression, direction: 
SortDirection)
   override def nullable: Boolean = child.nullable
 
   override def toString: String = s"$child ${direction.sql}"
-  override def sql: String = child.sql + " " + direction.sql
+  override def sql: String =
+child.sql + " " + direction.sql + s"${if (nullOrder!=null) " " + 
nullOrder.sql else ""}"
--- End diff --

string interpolation seems to add code here :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76532654
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala
 ---
@@ -36,11 +40,19 @@ case object Descending extends SortDirection {
   override def sql: String = "DESC"
 }
 
+case object NullFirst extends NullOrdering{
+  override def sql: String = "NULLS FIRST"
+}
+
+case object NullLast extends NullOrdering{
+  override def sql: String = "NULLS LAST"
+}
+
 /**
  * An expression that can be used to sort a tuple.  This class extends 
expression primarily so that
  * transformations over expression will descend into its child.
  */
-case class SortOrder(child: Expression, direction: SortDirection)
+case class SortOrder(child: Expression, direction: SortDirection, 
nullOrder: NullOrdering = null)
--- End diff --

Skip that. Lets not make this `null` and just provide the default nulls 
first behavior. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76532409
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala
 ---
@@ -36,11 +40,19 @@ case object Descending extends SortDirection {
   override def sql: String = "DESC"
 }
 
+case object NullFirst extends NullOrdering{
+  override def sql: String = "NULLS FIRST"
+}
+
+case object NullLast extends NullOrdering{
+  override def sql: String = "NULLS LAST"
+}
+
 /**
  * An expression that can be used to sort a tuple.  This class extends 
expression primarily so that
  * transformations over expression will descend into its child.
  */
-case class SortOrder(child: Expression, direction: SortDirection)
+case class SortOrder(child: Expression, direction: SortDirection, 
nullOrder: NullOrdering = null)
--- End diff --

I am not really against default parameters, but we typically do not do 
this. How many LOC would you have to change not to have a default parameter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76532391
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala
 ---
@@ -28,6 +28,10 @@ abstract sealed class SortDirection {
   def sql: String
 }
 
+abstract sealed class NullOrdering {
+def sql: String
--- End diff --

Nit: spacing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r76532386
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/unsafe/sort/RadixSortSuite.scala
 ---
@@ -52,6 +52,7 @@ class RadixSortSuite extends SparkFunSuite with Logging {
   new PrefixComparators.RadixSortSupport {
--- End diff --

We need add tests for nulls first and last.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in OR...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14842
  
@xwu0226 I glanced over it and this looks like the right approach.

One a high level I would break this up in two separate PRs: one to add 
`NULLS FIRST`/`NULLS LAST` sorting to the unsafe sorter (create a separate JIRA 
ticket for this), and the second one to integrate this in Spark SQL (I would 
repurpose this one for that). This makes the PRs more targeted and easier to 
review.

I do have one question/comment regarding this PR. Why do we need to add 
three kinds of null ordering: unknown ordering, nulls first and nulls last? 
Can't we merge the unknown ordering with the default behavior?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14444: [SPARK-16839] [SQL] redundant aliases after cleanupAlias...

2016-08-28 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/1
  
@eyalfa I would favor removing the `Create*Struct` classes altogether. 
Could you elaborate more on how this connects to `Encoders`?

@cloud-fan what is your take on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9571: [SPARK-11373] [CORE] Add metrics to the History Server an...

2016-08-28 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/9571
  
Test failures timeout related; unlikely to be due to this patch
```
Test Result (2 failures / +2)
org.apache.spark.sql.hive.HiveSparkSubmitSuite.dir
org.apache.spark.sql.hive.HiveSparkSubmitSuite.dir
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14691: [SPARK-16407][STREAMING] Allow users to supply custom st...

2016-08-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14691
  
**[Test build #64551 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64551/consoleFull)**
 for PR 14691 at commit 
[`c7bbffc`](https://github.com/apache/spark/commit/c7bbffcdbdc1723e165eec0bb481b1f385e18ac9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14746: [SPARK-17180] [SQL] Fix View Resolution Order in ALTER V...

2016-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14746
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64549/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14746: [SPARK-17180] [SQL] Fix View Resolution Order in ALTER V...

2016-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14746
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14746: [SPARK-17180] [SQL] Fix View Resolution Order in ALTER V...

2016-08-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14746
  
**[Test build #64549 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64549/consoleFull)**
 for PR 14746 at commit 
[`5606344`](https://github.com/apache/spark/commit/5606344584c1c45cf49d2a95158c55193e777a9e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14746: [SPARK-17180] [SQL] Fix View Resolution Order in ALTER V...

2016-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14746
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64547/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14746: [SPARK-17180] [SQL] Fix View Resolution Order in ALTER V...

2016-08-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14746
  
**[Test build #64547 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64547/consoleFull)**
 for PR 14746 at commit 
[`47fdf1f`](https://github.com/apache/spark/commit/47fdf1f217d4ea53e01827877620616be4d0efee).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14746: [SPARK-17180] [SQL] Fix View Resolution Order in ALTER V...

2016-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14746
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14853: [SparkR][Minor] Fix LDA doc

2016-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14853
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64550/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14853: [SparkR][Minor] Fix LDA doc

2016-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14853
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14853: [SparkR][Minor] Fix LDA doc

2016-08-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14853
  
**[Test build #64550 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64550/consoleFull)**
 for PR 14853 at commit 
[`0a47e94`](https://github.com/apache/spark/commit/0a47e948d775116e275ad652cdba3c9c4adc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14531: [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in C...

2016-08-28 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14531
  
After reading Hive source codes, the six statistic-related table properties 
should be excluded: `numFiles`, `numPartitions`, `totalSize`, `numRows`, 
`rawDataSize` and `COLUMN_STATS_ACCURATE`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14853: [SparkR][Minor] Fix LDA doc

2016-08-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14853
  
**[Test build #64550 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64550/consoleFull)**
 for PR 14853 at commit 
[`0a47e94`](https://github.com/apache/spark/commit/0a47e948d775116e275ad652cdba3c9c4adc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org