date:20160901

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14452
  
**[Test build #64758 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64758/consoleFull)**
 for PR 14452 at commit 
[`e9b0952`](https://github.com/apache/spark/commit/e9b09527ca98b3f99b43be3a028f04a207422389).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionState to...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14883
  
**[Test build #64767 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64767/consoleFull)**
 for PR 14883 at commit 
[`ad37055`](https://github.com/apache/spark/commit/ad37055619b6ca278ff9f263229e5586273572c6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14452
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14452
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64758/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14433: [SPARK-16829][SparkR]:sparkR sc.setLogLevel doesn...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14433#discussion_r77125115
  
--- Diff: core/src/main/scala/org/apache/spark/internal/Logging.scala ---
@@ -135,7 +136,12 @@ private[spark] trait Logging {
 val replLevel = Option(replLogger.getLevel()).getOrElse(Level.WARN)
 if (replLevel != rootLogger.getEffectiveLevel()) {
   System.err.printf("Setting default log level to \"%s\".\n", 
replLevel)
-  System.err.println("To adjust logging level use 
sc.setLogLevel(newLevel).")
+  if (SparkSubmit.isRShell) {
--- End diff --

I personally think it's a bit ugly to pipe through this info with extra 
methods and so on. It seems simpler just to amend the message to also state the 
right command for sparkr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...

2016-09-01 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14712
  
Maybe I found another bug in the master branch?

When calculating statistics for data source tables, we do not exclude the 
staging directory. However, we exclude them when `AnalyzeTableCommand` 
calculating the size. Since we convert Hive serde tables to data source tables, 
it sounds like we should also exclude Hive staging directory, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionS...

2016-09-01 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14883#discussion_r77125955
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala ---
@@ -69,6 +71,29 @@ private[sql] class SharedState(val sparkContext: 
SparkContext) extends Logging {
   val jarClassLoader = new NonClosableMutableURLClassLoader(
 org.apache.spark.util.Utils.getContextOrSparkClassLoader)
 
+  /**
+   * Add a global-scoped jar
+   */
+  def addJar(path: String): Unit = {
+if (sparkContext.conf.get(CATALOG_IMPLEMENTATION) == "hive") {
+  // Hive metastore supports custom serde.
+  externalCatalog.addJar(path)
--- End diff --

We call `addJar` only when the external catalog is `Hive metastore`. This 
is just consistent with our current implementation: 


https://github.com/apache/spark/blob/261c55dd8808502fb7f3384eb537d26a4a8123d7/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala#L104


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14659
  
**[Test build #64757 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64757/consoleFull)**
 for PR 14659 at commit 
[`ae42093`](https://github.com/apache/spark/commit/ae42093e59e37d0a4fda4280f2bbffec18c594d3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14659
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14659
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64757/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13584
  
**[Test build #64765 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64765/consoleFull)**
 for PR 13584 at commit 
[`1701252`](https://github.com/apache/spark/commit/1701252cf86a615874126215d956fd32d8eab0d0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13584
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64765/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13584
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14868: [SPARK-16283][SQL] Implements percentile_approx a...

Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r77126967
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints, Longs}
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * an array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yields better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` 
is the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
+accuracyExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[PercentileDigest] {
+
+  def this(child: Expression, percentageExpression: Expression, 
accuracyExpression: Expression) = {
+this(child, percentageExpression, accuracyExpression, 0, 0)
+  }
+
+  def this(child: Expression, percentageExpression: Expression) = {
+this(child, percentageExpression, 
Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
+  }
+
+  // Mark as lazy so that accuracyExpression is not evaluated during tree 
transformation.
+  private lazy val

[GitHub] spark pull request #14868: [SPARK-16283][SQL] Implements percentile_approx a...

Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14868#discussion_r77127003
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -0,0 +1,321 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints, Longs}
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{InternalRow}
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions._
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest}
+import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.QuantileSummaries
+import 
org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, 
Stats}
+import org.apache.spark.sql.types._
+
+/**
+ * The ApproximatePercentile function returns the approximate 
percentile(s) of a column at the given
+ * percentage(s). A percentile is a watermark value below which a given 
percentage of the column
+ * values fall. For example, the percentile of column `col` at percentage 
50% is the median of
+ * column `col`.
+ *
+ * This function supports partial aggregation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param percentageExpression Expression that represents a single 
percentage value or
+ * an array of percentage values. Each 
percentage value must be between
+ * 0.0 and 1.0.
+ * @param accuracyExpression Integer literal expression of approximation 
accuracy. Higher value
+ *   yields better accuracy, the default value is
+ *   DEFAULT_PERCENTILE_ACCURACY.
+ */
+@ExpressionDescription(
+  usage =
+"""
+  _FUNC_(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
+  column `col` at the given percentage. The value of percentage must 
be between 0.0
+  and 1.0. The `accuracy` parameter (default: 1) is a positive 
integer literal which
+  controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
+  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
+
+  _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - 
Returns the approximate
+  percentile array of column `col` at the given percentage array. Each 
value of the
+  percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 1) is
+   a positive integer literal which controls approximation accuracy at 
the cost of memory.
+   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` 
is the relative error of
+   the approximation.
+""")
+case class ApproximatePercentile(
+child: Expression,
+percentageExpression: Expression,
+accuracyExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[PercentileDigest] {
+
+  def this(child: Expression, percentageExpression: Expression, 
accuracyExpression: Expression) = {
+this(child, percentageExpression, accuracyExpression, 0, 0)
+  }
+
+  def this(child: Expression, percentageExpression: Expression) = {
+this(child, percentageExpression, 
Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY))
+  }
+
+  // Mark as lazy so that accuracyExpression is not evaluated during tree 
transformation.
+  private lazy val

[GitHub] spark pull request #14913: [SPARK-17358][SQL] Cached table(parquet/orc) shou...

2016-09-01 Thread watermen

GitHub user watermen opened a pull request:

https://github.com/apache/spark/pull/14913

[SPARK-17358][SQL] Cached table(parquet/orc) should be shard between 
beelines

## What changes were proposed in this pull request?
Cached table(parquet/orc) couldn't be shard between beelines, because the 
`sameResult` method used by `CacheManager` always return false when compare two 
`HadoopFsRelation` in different beelines. So I override the `equals` and 
`hashCode` in `HadoopFsRelation`.(Just compare the locations)

## How was this patch tested?
Beeline1
```
1: jdbc:hive2://localhost:1> cache table src_pqt;
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (5.143 seconds)
1: jdbc:hive2://localhost:1> explain select * from src_pqt;

++--+
|   

 plan   


 |

++--+
| == Physical Plan ==
InMemoryTableScan [key#49, value#50]
   +- InMemoryRelation [key#49, value#50], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas), `src_pqt`
 +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, 
Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct  |

++--+
```

Beeline2
```
0: jdbc:hive2://localhost:1> explain select * from src_pqt;

++--+
|   

 plan   


 |

++--+
| == Physical Plan ==
InMemoryTableScan [key#68, value#69]
   +- InMemoryRelation [key#68, value#69], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas), `src_pqt`
 +- *FileScan parquet default.src_pqt[key#0,value#1] Batched: true, 
Format: ParquetFormat, InputPaths: hdfs://199.0.0.1:9000/qiyadong/src_pqt, 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct  |

+-

[GitHub] spark pull request #14901: [SPARK-17347][SQL][Examples]Encoder in Dataset ex...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14901#discussion_r77127538
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala ---
@@ -203,7 +203,7 @@ object SparkSQLExample {
 // No pre-defined encoders for Dataset[Map[K,V]], define explicitly
 implicit val mapEncoder = 
org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
 // Primitive types and case classes can be also defined as
-implicit val stringIntMapEncoder: Encoder[Map[String, Int]] = 
ExpressionEncoder()
+// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = 
ExpressionEncoder()
--- End diff --

This compiled before though, right? you're saying it's unnecessary because 
of the implicit in the line above? that seems fine but let's delete these 2 
lines then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) i...

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/14914

[SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) instead of 
ArrayBuffer.append(A) in performance critical paths

## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than 
`ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / 
unboxing.

## How was this patch tested?

N/A



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark append_to_plus_eq_v2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14914.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14914


commit fba1c5e430f90163cc10266812f2a0137d012be3
Author: Liwei Lin 
Date:   2016-08-15T02:56:58Z

append -> +=




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) instead ...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14914
  
**[Test build #64769 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64769/consoleFull)**
 for PR 14914 at commit 
[`fba1c5e`](https://github.com/apache/spark/commit/fba1c5e430f90163cc10266812f2a0137d012be3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14913: [SPARK-17358][SQL] Cached table(parquet/orc) should be s...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14913
  
**[Test build #64768 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64768/consoleFull)**
 for PR 14913 at commit 
[`fc93356`](https://github.com/apache/spark/commit/fc933563c1b5a9acc856c03ae4eba039d1f114bb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) i...

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14914#discussion_r77128029
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -999,7 +999,7 @@ object Matrices {
 val data = new ArrayBuffer[(Int, Int, Double)]()
 dnMat.foreachActive { (i, j, v) =>
   if (v != 0.0) {
-data.append((i, j + startCol, v))
+data.+=((i, j + startCol, v))
--- End diff --

Writing it as `data += (i, j + startCol, v)` would yield compilation errors:
```
Too many arguments for methods =+(A)
Type mismatch, expected: (Int, Int, Double), actual: Int
```
thus here it written as `data.+=((i, j + startCol, v))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14911: [SPARK-17355] Workaround for HIVE-14684 / HiveResultSetM...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14911
  
**[Test build #64761 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64761/consoleFull)**
 for PR 14911 at commit 
[`6b56880`](https://github.com/apache/spark/commit/6b56880aa78a599fdf255d3668a848d9ad09691b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14911: [SPARK-17355] Workaround for HIVE-14684 / HiveResultSetM...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14911
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64761/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14911: [SPARK-17355] Workaround for HIVE-14684 / HiveResultSetM...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14911
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) instead ...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14914
  
**[Test build #64770 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64770/consoleFull)**
 for PR 14914 at commit 
[`980a3a4`](https://github.com/apache/spark/commit/980a3a46e42330a04ab5ed66867123681806a4c0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14910: [SPARK-17271] [SQL] Remove redundant `semanticEquals()` ...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14910
  
**[Test build #64760 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64760/consoleFull)**
 for PR 14910 at commit 
[`56eb557`](https://github.com/apache/spark/commit/56eb55711581d68c9dbd6c01004f6f4cb45a7b6f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14910: [SPARK-17271] [SQL] Remove redundant `semanticEquals()` ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14910
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64760/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14910: [SPARK-17271] [SQL] Remove redundant `semanticEquals()` ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14910
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14823: [SPARK-17257][SQL] the physical plan of CREATE TABLE or ...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14823
  
**[Test build #64762 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64762/consoleFull)**
 for PR 14823 at commit 
[`52a40d9`](https://github.com/apache/spark/commit/52a40d95a070466288161fee3b3985c94363b660).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14823: [SPARK-17257][SQL] the physical plan of CREATE TABLE or ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14823
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64762/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14823: [SPARK-17257][SQL] the physical plan of CREATE TABLE or ...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14823
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) i...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14914#discussion_r77129894
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -999,7 +999,7 @@ object Matrices {
 val data = new ArrayBuffer[(Int, Int, Double)]()
 dnMat.foreachActive { (i, j, v) =>
   if (v != 0.0) {
-data.append((i, j + startCol, v))
+data.+=((i, j + startCol, v))
--- End diff --

You can write `data += ((i, j + startCol, v))`
Yes, I think it may be worth optimizing this because append actually 
expects varargs.
There are many more instances in the code though. I think it's reasonable 
to fix this everywhere. At least non-test code, and at least anything where 
performance could be important.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14640: [SPARK-17055] [MLLIB] add labelKFold to CrossValidator

Github user VinceShieh commented on the issue:

https://github.com/apache/spark/pull/14640
  
Updates:
1. code refactoring. Rename the API to align with Sklearn changes
2. add implementation in CrossValidator


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14892: [SPARK-17329] [BUILD] Don't build PRs with -Pyarn unless...

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14892
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14892: [SPARK-17329] [BUILD] Don't build PRs with -Pyarn...

Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/14892


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) i...

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14914#discussion_r77130880
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -999,7 +999,7 @@ object Matrices {
 val data = new ArrayBuffer[(Int, Int, Double)]()
 dnMat.foreachActive { (i, j, v) =>
   if (v != 0.0) {
-data.append((i, j + startCol, v))
+data.+=((i, j + startCol, v))
--- End diff --

I don't have strong preference, but people might mis-interpret `data += 
((i, j + startCol, v))` as adding a tuple of a tuple? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14891: [SQL][DOC][MINOR] Add (Scala-specific) and (Java-specifi...

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14891
  
OK fair enough LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) i...

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14914#discussion_r77131280
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -999,7 +999,7 @@ object Matrices {
 val data = new ArrayBuffer[(Int, Int, Double)]()
 dnMat.foreachActive { (i, j, v) =>
   if (v != 0.0) {
-data.append((i, j + startCol, v))
+data.+=((i, j + startCol, v))
--- End diff --

> I think it's reasonable to fix this everywhere.
ok let me do that -- I was conservative in fixing only these instances that 
were in a loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14531: [SPARK-17353] [SPARK-16943] [SPARK-16942] [SQL] Fix mult...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14531
  
**[Test build #64763 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64763/consoleFull)**
 for PR 14531 at commit 
[`4bcb306`](https://github.com/apache/spark/commit/4bcb306b3a3801e1bc76be14487e097c0c517b8f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14908: [WEBUI][SPARK-17352]Executor computing time can be negat...

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14908
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14531: [SPARK-17353] [SPARK-16943] [SPARK-16942] [SQL] Fix mult...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14531
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64763/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14531: [SPARK-17353] [SPARK-16943] [SPARK-16942] [SQL] Fix mult...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14531
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14900: [WEBUI][SPARK-17342] Style of event timeline is broken

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14900
  
LGTM. In light of this change, was 
https://github.com/apache/spark/pull/14791 necessary, or at least still a valid 
change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r77131705
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -276,24 +276,64 @@ class ChiSqSelector(object):
 """
 Creates a ChiSquared feature selector.
 
-:param numTopFeatures: number of features that selector will select.
-
 >>> data = [
 ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
 ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
 ... LabeledPoint(1.0, [0.0, 9.0, 8.0]),
 ... LabeledPoint(2.0, [8.0, 9.0, 5.0])
 ... ]
->>> model = ChiSqSelector(1).fit(sc.parallelize(data))
+>>> model = 
ChiSqSelector().setNumTopFeatures(1).fit(sc.parallelize(data))
+>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
+SparseVector(1, {0: 6.0})
+>>> model.transform(DenseVector([8.0, 9.0, 5.0]))
+DenseVector([5.0])
+>>> model = 
ChiSqSelector().setPercentile(0.34).fit(sc.parallelize(data))
 >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
 SparseVector(1, {0: 6.0})
 >>> model.transform(DenseVector([8.0, 9.0, 5.0]))
 DenseVector([5.0])
+>>> data = [
+... LabeledPoint(0.0, SparseVector(4, {0: 8.0, 1: 7.0})),
+... LabeledPoint(1.0, SparseVector(4, {1: 9.0, 2: 6.0, 3: 4.0})),
+... LabeledPoint(1.0, [0.0, 9.0, 8.0, 4.0]),
+... LabeledPoint(2.0, [8.0, 9.0, 5.0, 9.0])
+... ]
+>>> model = ChiSqSelector().setAlpha(0.1).fit(sc.parallelize(data))
+>>> model.transform(DenseVector([1.0,2.0,3.0,4.0]))
+DenseVector([4.0])
 
 .. versionadded:: 1.4.0
 """
-def __init__(self, numTopFeatures):
-self.numTopFeatures = int(numTopFeatures)
+def __init__(self):
+self.param = 50
--- End diff --

Hm, at least comments yes. But is it a problem to just use different fields 
for different values? they don't even have the same type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14863: [SPARK-16992][PYSPARK] use map comprehension in doc

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14863
  
I don't know if performance is important here. I'd rather either batch this 
together with other changes that make this change consistently or drop this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14567: [SPARK-16992][PYSPARK] Python Pep8 formatting and import...

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14567
  
OK I'd leave these changes to Python people like @davies @holdenk @MLnick 
to comment on from here. I think style changes can be OK if they're consistent, 
enforceable, and moving the code towards more standard style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77132904
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

OK, but this doesn't specify the behavior. It should be explicit that while 
data will go into buckets 0 through numBuckets-1, that NaN values will be 
counted in bucket numBuckets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77133636
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

Ah, I think this is a little bit at odds with the intent. We need to filter 
NaN before the data goes to approxQuantile. They should have no input on the 
quantiles. Then there's no need to filter out NaN from splits.

The message below can then remain unchanged from what it was before. This 
message is not related to the behavior of NaN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14868: [SPARK-16283][SQL] Implements percentile_approx aggregat...

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14868
  
LGTM, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14877: fixed typos

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14877
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14868: [SPARK-16283][SQL] Implements percentile_approx a...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14868


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14877: fixed typos

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14877


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14531: [SPARK-17353] [SPARK-16943] [SPARK-16942] [SQL] F...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14531


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77134887
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

I didnt take this approach, becoz it will increase the complexity, 
especially when the dataset is huge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14531: [SPARK-17353] [SPARK-16943] [SPARK-16942] [SQL] Fix mult...

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14531
  
merging to master! @gatorsmile can you send a new PR to backport it to 2.0? 
thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77135297
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

It's a correctness issue though. You will compute the wrong splits. Why is 
it complex? just filter the input, it seems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) i...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14914#discussion_r77135576
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -999,7 +999,7 @@ object Matrices {
 val data = new ArrayBuffer[(Int, Int, Double)]()
 dnMat.foreachActive { (i, j, v) =>
   if (v != 0.0) {
-data.append((i, j + startCol, v))
+data.+=((i, j + startCol, v))
--- End diff --

() can't create a tuple of 1 element so it shouldn't be ambiguous at all. I 
think it's preferable to the slightly funny .+= syntax.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14823: [SPARK-17257][SQL] the physical plan of CREATE TABLE or ...

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14823
  
thanks for the review, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14823: [SPARK-17257][SQL] the physical plan of CREATE TA...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14823


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14910: [SPARK-17271] [SQL] Remove redundant `semanticEquals()` ...

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14910
  
thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14910: [SPARK-17271] [SQL] Remove redundant `semanticEqu...

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14910


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14515: [SPARK-16926] [SQL] Remove partition columns from partit...

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14515
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14515: [SPARK-16926] [SQL] Remove partition columns from...

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14515#discussion_r77136669
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
@@ -162,7 +162,13 @@ private[hive] case class MetastoreRelation(
 
   val sd = new org.apache.hadoop.hive.metastore.api.StorageDescriptor()
   tPartition.setSd(sd)
-  sd.setCols(catalogTable.schema.map(toHiveColumn).asJava)
+
+  // Note: In Hive the schema and partition columns must be disjoint 
sets
+  val schema = catalogTable.schema.map(toHiveColumn).filter { c =>
+!catalogTable.partitionColumnNames.contains(c.getName)
--- End diff --

ah good catch! It would be better if we can have a test to prove the 
unnecessary conversion object inspector is removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14863: [SPARK-16992][PYSPARK] use map comprehension in doc

2016-09-01 Thread Stibbons

Github user Stibbons commented on the issue:

https://github.com/apache/spark/pull/14863

I agree. I would prefer if Spark examples also "promotes" the good practice
of Python, ie, replacing 'map' and 'filter' by list or map comprehension
('reduce' has no equivalent on comprehension), even though 'map'/'filter'
syntax might be closer to their equivalent on the TDDs, they are not the same.
I am not sure if there is a consensus over this point on the "data science"
community, but most of the Pythonists now happily promotes comprehension over
map/filter. Most of the time it is faster, especially when there is a
conversion to list after the map.
'map' may be faster than comprehension when a lambda is not used, is lazy
on Python 3 (one can use [generator
comprehension](http://stackoverflow.com/questions/364802/generator-comprehension#answer-364818)
on Python 2 or 3 to have the same result, thus should be aware of when to use
it or not).

Long story short: if Spark community agree, I can look for these
'map'/'filter' in the examples and replace them with comprehension.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14912: [SPARK-17357][SQL] Simplified predicates should be pushe...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14912
  
**[Test build #64766 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64766/consoleFull)**
 for PR 14912 at commit 
[`9e1c315`](https://github.com/apache/spark/commit/9e1c3159c0250bb921a83f923d5c9ebea1ffca42).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14912: [SPARK-17357][SQL] Simplified predicates should be pushe...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14912
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14912: [SPARK-17357][SQL] Simplified predicates should be pushe...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14912
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64766/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14863: [SPARK-16992][PYSPARK] use map comprehension in doc

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14863
  
OK well I'd leave it to people here with more taste to agree about what's 
canonical but I take your word for it. I'm mostly interested in consistency if 
anythign.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14515: [SPARK-16926] [SQL] Remove partition columns from partit...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14515
  
**[Test build #64771 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64771/consoleFull)**
 for PR 14515 at commit 
[`fd37123`](https://github.com/apache/spark/commit/fd3712360aa0fc05066b8c87b083f1b07fae762f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

2016-09-01 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14712#discussion_r77137539
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala
 ---
@@ -52,7 +52,8 @@ case class LogicalRelation(
 
   // Logical Relations are distinct if they have different output for the 
sake of transformations.
   override def equals(other: Any): Boolean = other match {
-case l @ LogicalRelation(otherRelation, _, _) => relation == 
otherRelation && output == l.output
+case l @ LogicalRelation(otherRelation, _, _) =>
+  relation == otherRelation && output == l.output
--- End diff --

I'll recover it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-09-01 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r77137991
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -276,24 +276,64 @@ class ChiSqSelector(object):
 """
 Creates a ChiSquared feature selector.
 
-:param numTopFeatures: number of features that selector will select.
-
 >>> data = [
 ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
 ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
 ... LabeledPoint(1.0, [0.0, 9.0, 8.0]),
 ... LabeledPoint(2.0, [8.0, 9.0, 5.0])
 ... ]
->>> model = ChiSqSelector(1).fit(sc.parallelize(data))
+>>> model = 
ChiSqSelector().setNumTopFeatures(1).fit(sc.parallelize(data))
+>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
+SparseVector(1, {0: 6.0})
+>>> model.transform(DenseVector([8.0, 9.0, 5.0]))
+DenseVector([5.0])
+>>> model = 
ChiSqSelector().setPercentile(0.34).fit(sc.parallelize(data))
 >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
 SparseVector(1, {0: 6.0})
 >>> model.transform(DenseVector([8.0, 9.0, 5.0]))
 DenseVector([5.0])
+>>> data = [
+... LabeledPoint(0.0, SparseVector(4, {0: 8.0, 1: 7.0})),
+... LabeledPoint(1.0, SparseVector(4, {1: 9.0, 2: 6.0, 3: 4.0})),
+... LabeledPoint(1.0, [0.0, 9.0, 8.0, 4.0]),
+... LabeledPoint(2.0, [8.0, 9.0, 5.0, 9.0])
+... ]
+>>> model = ChiSqSelector().setAlpha(0.1).fit(sc.parallelize(data))
+>>> model.transform(DenseVector([1.0,2.0,3.0,4.0]))
+DenseVector([4.0])
 
 .. versionadded:: 1.4.0
 """
-def __init__(self, numTopFeatures):
-self.numTopFeatures = int(numTopFeatures)
+def __init__(self):
+self.param = 50
--- End diff --

Hi @srowen , use different fields for different value is not a problem, 
just need another selectionType field, and in the fit function, the code will 
be:
if(selectionType == KBest)
   callMLlibFunc("fitChiSqSelectorKBest", self.numTopFeatures, data)
elseif(selectionType == Percentile)
   callMLlibFunc("fitChiSqSelectorKPercentile", self.percentile, data)
elseif(selecitonType == FPR)
   callMLlibFunc("fitChiSqSelectorFPR", self.alpha, data)
Is that ok? Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77138037
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val distinctSplits = splits.distinct
+val distinctSplits = splits.filter(!_.isNaN).distinct
--- End diff --

hmm, if we filter NaN before the data, with size m, goes to approxQuantile, 
the increased complexity would be o(m), filter NaN in the splits would normally 
be more cheap, with number of splits far less than that of entire dataset. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14915: [SPARK-17356][SQL] Fix out of memory issue when c...

GitHub user clockfly opened a pull request:

https://github.com/apache/spark/pull/14915

[SPARK-17356][SQL] Fix out of memory issue when calling TreeNode.toJSON

## What changes were proposed in this pull request?

class `org.apache.spark.sql.types.Metadata` is widely used in mllib to 
store some ml attributes. Metadata is commonly stored in `Alias` expression. 
The meta data can take a big memory footprint since the number of 
attributes is big ( in scale of million). 

```
case class Alias(child: Expression, name: String)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifier: Option[String] = None,
val explicitMetadata: Option[Metadata] = None,
override val isGenerated: java.lang.Boolean = false)
```

When `toJSON` is called on `Alias` expression, the meta data will also be 
converted to a big JSON string. 
If a plan contains many such kind of `Alias` expressions, it may trigger 
out of memory error when `toJSON` is called, since converting all meta data 
references to JSON will take huge memory.
 
With this PR, we will skip scanning Metadata when doing JSON conversion. 
For a reproducer and analysis, please look at jira 
https://issues.apache.org/jira/browse/SPARK-17356. 

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/clockfly/spark json_oom

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14915.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14915






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77138983
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

consider situation, when a high proportion of duplicated data and/or NaN 
exist in a data sample, the exact number of buckets is hard to get, it could be 
less than/equal to/ more than 'numBuckets'. what we can be sure is that, the 
NaN value if existed will be grouped in the last bucket.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14915: [SPARK-17356][SQL][WIP] Fix out of memory issue when gen...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14915
  
**[Test build #64772 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64772/consoleFull)**
 for PR 14915 at commit 
[`368e097`](https://github.com/apache/spark/commit/368e0971a4af43d065be854a231443cffa4f5769).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

2016-09-01 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14712#discussion_r77139470
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -88,24 +85,53 @@ case class AnalyzeTableCommand(tableName: String) 
extends RunnableCommand {
 }
   }.getOrElse(0L)
 
-// Update the Hive metastore if the total size of the table is 
different than the size
-// recorded in the Hive metastore.
-// This logic is based on 
org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
-if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
-  sessionState.catalog.alterTable(
-catalogTable.copy(
-  properties = relation.catalogTable.properties +
-(AnalyzeTableCommand.TOTAL_SIZE_FIELD -> 
newTotalSize.toString)))
-}
+updateTableStats(
+  catalogTable,
+  oldTotalSize = 
catalogTable.stats.map(_.sizeInBytes.toLong).getOrElse(0L),
+  oldRowCount = 
catalogTable.stats.flatMap(_.rowCount.map(_.toLong)).getOrElse(-1L),
+  newTotalSize = newTotalSize)
+
+  // data source tables have been converted into LogicalRelations
+  case logicalRel: LogicalRelation if 
logicalRel.catalogTable.isDefined =>
+updateTableStats(
+  logicalRel.catalogTable.get,
+  oldTotalSize = logicalRel.statistics.sizeInBytes.toLong,
+  oldRowCount = 
logicalRel.statistics.rowCount.map(_.toLong).getOrElse(-1L),
+  newTotalSize = logicalRel.relation.sizeInBytes)
--- End diff --

Yes, for the first analyze they are equal. But if there's  a second 
analyze, they are not equal, since logicalRel is in the cachedDataSourceTables, 
its statistics is the original one.
Anyway, there's a bug here. I think we can use catalogTable's statistics as 
the old one and I'll restore the test case to check size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14915: [SPARK-17356][SQL][WIP] Fix out of memory issue w...

Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/14915#discussion_r77139596
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala 
---
@@ -617,7 +618,9 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] 
extends Product {
 case s: String => JString(s)
 case u: UUID => JString(u.toString)
 case dt: DataType => dt.jsonValue
-case m: Metadata => m.jsonValue
+// SPARK-17356: In usage of mllib, Metadata may store a huge vector of 
data, transforming
--- End diff --

Current implementation of toJSON recursively searches the Map and Seq, and 
try to convert every field to JSON. 

It is quite risky, since we don't know what data is stored in unknown Seq 
and Map, and it may easily trigger OOM if the Seq or Map is a huge object.

Maybe we should disable converting "Seq" and Map?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...

2016-09-01 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/14712
  
@gatorsmile Yes, we should exclude the staging dir.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

2016-09-01 Thread ScrapCodes

Github user ScrapCodes commented on the issue:

https://github.com/apache/spark/pull/14151
  
Thanks @gatorsmile. I was actually wondering, where can I document this 
option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14915: [SPARK-17356][SQL][WIP] Fix out of memory issue when gen...

Github user clockfly commented on the issue:

https://github.com/apache/spark/pull/14915
  
@mengxr  @yhuai, comments?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14151
  
**[Test build #64773 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64773/consoleFull)**
 for PR 14151 at commit 
[`8ac37c1`](https://github.com/apache/spark/commit/8ac37c1b774046efe39173e4e8fa91c0feb68f49).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14873: [SPARK-17308]Improved the spark core code by replacing a...

2016-09-01 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/14873
  
From my understanding it is more like a personal preference rather than 
code style issue. We may change the code for now, but how can we guarantee 
other people not to use pattern match in future? So IMO if we want to fix this, 
it would be better to add this as a rule for style check, though personally I 
don't think this change is so necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14873: [SPARK-17308]Improved the spark core code by replacing a...

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14873
  
Although I'm slightly positive on it, I would not merge if there are two 
slightly negative reviews. I think it's a bit more than style preference, but 
not much more. Is there ever a benefit to pattern-matching a boolean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionState to...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14883
  
**[Test build #64767 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64767/consoleFull)**
 for PR 14883 at commit 
[`ad37055`](https://github.com/apache/spark/commit/ad37055619b6ca278ff9f263229e5586273572c6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionState to...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14883
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionState to...

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14883
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64767/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r77141646
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala
 ---
@@ -26,21 +26,40 @@ import 
org.apache.spark.util.collection.unsafe.sort.PrefixComparators.DoublePref
 
 abstract sealed class SortDirection {
   def sql: String
+  def defaultNullOrdering: NullOrdering
+}
+
+abstract sealed class NullOrdering {
+  def sql: String
 }
 
 case object Ascending extends SortDirection {
   override def sql: String = "ASC"
+  override def defaultNullOrdering: NullOrdering = NullFirst
 }
 
+// default null order is last for desc
 case object Descending extends SortDirection {
   override def sql: String = "DESC"
+  override def defaultNullOrdering: NullOrdering = NullLast
+}
+
+case object NullFirst extends NullOrdering{
+  override def sql: String = "NULLS FIRST"
+}
+
+case object NullLast extends NullOrdering{
+  override def sql: String = "NULLS LAST"
 }
 
 /**
  * An expression that can be used to sort a tuple.  This class extends 
expression primarily so that
  * transformations over expression will descend into its child.
  */
-case class SortOrder(child: Expression, direction: SortDirection)
+case class SortOrder(
--- End diff --

Update the sql and toString methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14914: [SPARK-17359][SQL][MLLib] Use ArrayBuffer.+=(A) i...

2016-09-01 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14914#discussion_r77141688
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -999,7 +999,7 @@ object Matrices {
 val data = new ArrayBuffer[(Int, Int, Double)]()
 dnMat.foreachActive { (i, j, v) =>
   if (v != 0.0) {
-data.append((i, j + startCol, v))
+data.+=((i, j + startCol, v))
--- End diff --

You can do

```
data += Tuple3(i, j + startCol, v)
```

probably the most clear


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14858: [SPARK-17219][ML] Add NaN value handling in Bucke...

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14858#discussion_r77141804
  
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
 ## QuantileDiscretizer
 
 `QuantileDiscretizer` takes a column with continuous features and outputs 
a column with binned
-categorical features. The number of bins is set by the `numBuckets` 
parameter.
+categorical features. The number of bins is set by the `numBuckets` 
parameter, but it is
--- End diff --

It's always possible to have less data than buckets. The problem here is 
that you might have enough non-NaN data, even, to properly determine distinct 
buckets, but fail to do so because of NaNs making some splits NaN. You'd end up 
with fewer splits than intended when you could have created all meaningful 
splits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r77141840
  
--- Diff: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---
@@ -324,7 +324,7 @@ queryPrimary
 ;
 
 sortItem
-: expression ordering=(ASC | DESC)?
+: expression ordering=(ASC | DESC)? (NULLS nullOrder=(LAST | FIRST))?
--- End diff --

Are you allowed to write `ORDER BY a NULL FIRST`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r77142212
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -1204,9 +1204,29 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] 
with Logging {
*/
   override def visitSortItem(ctx: SortItemContext): SortOrder = 
withOrigin(ctx) {
 if (ctx.DESC != null) {
-  SortOrder(expression(ctx.expression), Descending)
+  if (ctx.nullOrder != null) {
+ctx.nullOrder.getType match {
+  case SqlBaseParser.FIRST =>
+SortOrder(expression(ctx.expression), Descending, NullFirst)
+  case SqlBaseParser.LAST =>
+SortOrder(expression(ctx.expression), Descending)
+  case _ => throw new ParseException(s"NULL ordering can be only 
FIRST or LAST", ctx)
--- End diff --

The parser guarantees this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14873: [SPARK-17308]Improved the spark core code by replacing a...

2016-09-01 Thread shiv4nsh

Github user shiv4nsh commented on the issue:

https://github.com/apache/spark/pull/14873
  
@jerryshao  : It is always better to not to use the pattern matching on the 
boolean AFAIK , and it reduces the bytecode too.. you can take a look here: 
http://stackoverflow.com/questions/9266822/pattern-matching-vs-if-else where 
Rex Kerr has explained it for the better understanding ! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r77142563
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -1204,9 +1204,29 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] 
with Logging {
*/
   override def visitSortItem(ctx: SortItemContext): SortOrder = 
withOrigin(ctx) {
 if (ctx.DESC != null) {
--- End diff --

the code below can be a lot shorter.
```scala
val direction = if (ctx.DESC != null) {
  Descending
} else {
  Ascending
}
val nullOrdering = if (ctx.FIRST != null) {
  NullFirst
} else if (ctx.LAST != null) {
  NullLast
} else {
  direction.defaultNullOrdering
}
SortOrder(expression(ctx.expression), direction, nullOrdering)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r77143025
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SortPrefixUtils.scala ---
@@ -40,29 +40,64 @@ object SortPrefixUtils {
 
   def getPrefixComparator(sortOrder: SortOrder): PrefixComparator = {
 sortOrder.dataType match {
-  case StringType =>
-if (sortOrder.isAscending) PrefixComparators.STRING else 
PrefixComparators.STRING_DESC
-  case BinaryType =>
-if (sortOrder.isAscending) PrefixComparators.BINARY else 
PrefixComparators.BINARY_DESC
+  case StringType => getPrefixComparatorWithNullOrder(sortOrder, 
"STRING")
+  case BinaryType => getPrefixComparatorWithNullOrder(sortOrder, 
"BINARY")
   case BooleanType | ByteType | ShortType | IntegerType | LongType | 
DateType | TimestampType =>
-if (sortOrder.isAscending) PrefixComparators.LONG else 
PrefixComparators.LONG_DESC
+getPrefixComparatorWithNullOrder(sortOrder, "LONG")
   case dt: DecimalType if dt.precision - dt.scale <= 
Decimal.MAX_LONG_DIGITS =>
-if (sortOrder.isAscending) PrefixComparators.LONG else 
PrefixComparators.LONG_DESC
-  case FloatType | DoubleType =>
-if (sortOrder.isAscending) PrefixComparators.DOUBLE else 
PrefixComparators.DOUBLE_DESC
-  case dt: DecimalType =>
-if (sortOrder.isAscending) PrefixComparators.DOUBLE else 
PrefixComparators.DOUBLE_DESC
+getPrefixComparatorWithNullOrder(sortOrder, "LONG")
+  case FloatType | DoubleType => 
getPrefixComparatorWithNullOrder(sortOrder, "DOUBLE")
+  case dt: DecimalType => getPrefixComparatorWithNullOrder(sortOrder, 
"DOUBLE")
   case _ => NoOpPrefixComparator
 }
   }
 
+  private def getPrefixComparatorWithNullOrder(
+ sortOrder: SortOrder, signedType: String): PrefixComparator = {
+sortOrder.direction match {
+  case Ascending if (sortOrder.nullOrdering == NullLast) =>
+signedType match {
+  case "LONG" => PrefixComparators.LONG_NULLLAST
+  case "STRING" => PrefixComparators.STRING_NULLLAST
+  case "BINARY" => PrefixComparators.BINARY_NULLLAST
+  case "DOUBLE" => PrefixComparators.DOUBLE_NULLLAST
+}
+  case Ascending =>
+// or the default NULLS FIRST
+signedType match {
+  case "LONG" => PrefixComparators.LONG
+  case "STRING" => PrefixComparators.STRING
+  case "BINARY" => PrefixComparators.BINARY
+  case "DOUBLE" => PrefixComparators.DOUBLE
+}
+  case Descending if (sortOrder.nullOrdering == NullFirst) =>
+signedType match {
+  case "LONG" => PrefixComparators.LONG_DESC_NULLFIRST
+  case "STRING" => PrefixComparators.STRING_DESC_NULLFIRST
+  case "BINARY" => PrefixComparators.BINARY_DESC_NULLFIRST
+  case "DOUBLE" => PrefixComparators.DOUBLE_DESC_NULLFIRST
+}
+  case Descending =>
+// or the default NULLS LAST
+signedType match {
+  case "LONG" => PrefixComparators.LONG_DESC
+  case "STRING" => PrefixComparators.STRING_DESC
+  case "BINARY" => PrefixComparators.BINARY_DESC
+  case "DOUBLE" => PrefixComparators.DOUBLE_DESC
+}
+  case _ => throw new IllegalArgumentException(
+"This should not happen. Contact Spark contributors for this 
error.")
+}
+  }
+
   /**
* Creates the prefix comparator for the first field in the given 
schema, in ascending order.
*/
   def getPrefixComparator(schema: StructType): PrefixComparator = {
 if (schema.nonEmpty) {
   val field = schema.head
-  getPrefixComparator(SortOrder(BoundReference(0, field.dataType, 
field.nullable), Ascending))
+  getPrefixComparator(
--- End diff --

revert this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14842: [SPARK-10747][SQL] Support NULLS FIRST|LAST claus...

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14842#discussion_r77143047
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SortPrefixUtils.scala ---
@@ -89,7 +124,8 @@ object SortPrefixUtils {
* Returns whether the fully sorting on the specified key field is 
possible with radix sort.
*/
   def canSortFullyWithPrefix(field: StructField): Boolean = {
-canSortFullyWithPrefix(SortOrder(BoundReference(0, field.dataType, 
field.nullable), Ascending))
+canSortFullyWithPrefix(
--- End diff --

revert this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14913: [SPARK-17358][SQL] Cached table(parquet/orc) should be s...

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14913
  
**[Test build #64768 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64768/consoleFull)**
 for PR 14913 at commit 
[`fc93356`](https://github.com/apache/spark/commit/fc933563c1b5a9acc856c03ae4eba039d1f114bb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14913: [SPARK-17358][SQL] Cached table(parquet/orc) should be s...