[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130798893
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,13 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables and hive serde tables created by spark 2.1 
or higher,
--- End diff --

Also add a test for hive serde tables?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties from s...

2017-08-01 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/18668
  
@vanzin 
> the configuration of the execution Hive


Does this mean a hive client initialized by 
[HiveUtils.newClientForExecution](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L241)?
 If true, this is ONLY used in 
[HiveThiftSever2](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L88)
 after SparkContext initialized.

Example:
`sbin/start-thriftserver.sh  --conf 
spark.hadoop.hive.server2.thrift.port=11001 --hiveconf 
hive.server2.thrift.port=11000` 
Spark Thrift Server will take 11001 as the port. `hive.server2.thrift.port` 
firstly be parsed `11000`, but it will be re-writted to `11001`  
[HiveThriftServer2.scala#L90](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L90)
 after SparkSQLEnv init the sc. IMO  the `spark.hadoop.xxx` properties can be 
treat as special spark properties, which should have higher priority than its 
original form `xxx`. In SparkSQLCliDriver we shall obey this rule too.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs on requ...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18812
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs ...

2017-08-01 Thread zuotingbing
GitHub user zuotingbing opened a pull request:

https://github.com/apache/spark/pull/18812

[SPARK-21606][SQL]HiveThriftServer2 catches OOMs on request threads

## What changes were proposed in this pull request?

Refer to Hive, ThriftCLIService methods such as ExecuteStatement are 
apparently capable of catching OOMs because they get wrapped in RTE by 
HiveSessionProxy. I create a PR to fix this bug in Spark also.

## How was this patch tested?

Exist tests and manual tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zuotingbing/spark OOM_HiveThriftServer2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18812.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18812


commit 759330df5685a6f162d9c9666db03b08148b5ba9
Author: zuotingbing 
Date:   2017-08-02T06:44:08Z

[SPARK-21606][SQL]HiveThriftServer2 catches OOMs on request threads




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18808: [SPARK-21605][HOT-FIX][BUILD] Let IntelliJ IDEA correctl...

2017-08-01 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/18808
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r130795748
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3036,6 +3052,9 @@ def test_toPandas_arrow_toggle(self):
 pdf = df.toPandas()
 self.spark.conf.set("spark.sql.execution.arrow.enable", "true")
--- End diff --

We do not set it back after converting to `true`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18808: [SPARK-21605][HOT-FIX][BUILD] Let IntelliJ IDEA correctl...

2017-08-01 Thread gslowikowski
Github user gslowikowski commented on the issue:

https://github.com/apache/spark/pull/18808
  
I removed too much in #18750. Eclipse uses these two configuration 
parameters of `maven-compiler-plugin` too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130794138
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric) ++ metrics)
+new SummaryBuilderImpl(typedMetrics, computeMetrics)
+  }
+
+  def mean(col: Co

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130794916
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala ---
@@ -0,0 +1,619 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import org.scalatest.exceptions.TestFailedException
+
+import org.apache.spark.{SparkException, SparkFunSuite}
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.util.TestingUtils._
+import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => 
OldVectors}
+import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, 
Statistics}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
+
+class SummarizerSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  import testImplicits._
+  import Summarizer._
+  import SummaryBuilderImpl._
+
+  private case class ExpectedMetrics(
+  mean: Seq[Double],
+  variance: Seq[Double],
+  count: Long,
+  numNonZeros: Seq[Long],
+  max: Seq[Double],
+  min: Seq[Double],
+  normL2: Seq[Double],
+  normL1: Seq[Double])
+
+  // The input is expected to be either a sparse vector, a dense vector or 
an array of doubles
+  // (which will be converted to a dense vector)
+  // The expected is the list of all the known metrics.
+  //
+  // The tests take an list of input vectors and a list of all the summary 
values that
+  // are expected for this input. They currently test against some fixed 
subset of the
+  // metrics, but should be made fuzzy in the future.
+
+  private def testExample(name: String, input: Seq[Any], exp: 
ExpectedMetrics): Unit = {
+def inputVec: Seq[Vector] = input.map {
+  case x: Array[Double @unchecked] => Vectors.dense(x)
+  case x: Seq[Double @unchecked] => Vectors.dense(x.toArray)
+  case x: Vector => x
+  case x => throw new Exception(x.toString)
+}
+
+val s = {
+  val s2 = new MultivariateOnlineSummarizer
+  inputVec.foreach(v => s2.add(OldVectors.fromML(v)))
+  s2
+}
+
+// Because the Spark context is reset between tests, we cannot hold a 
reference onto it.
+def wrapped() = {
+  val df = sc.parallelize(inputVec).map(Tuple1.apply).toDF("features")
+  val c = df.col("features")
+  (df, c)
+}
+
+registerTest(s"$name - mean only") {
+  val (df, c) = wrapped()
+  compare(df.select(metrics("mean").summary(c), mean(c)), 
Seq(Row(exp.mean), s.mean))
+}
+
+registerTest(s"$name - mean only (direct)") {
+  val (df, c) = wrapped()
+  compare(df.select(mean(c)), Seq(exp.mean))
+}
+
+registerTest(s"$name - variance only") {
+  val (df, c) = wrapped()
+  compare(df.select(metrics("variance").summary(c), variance(c)),
+Seq(Row(exp.variance), s.variance))
+}
+
+registerTest(s"$name - variance only (direct)") {
+  val (df, c) = wrapped()
+  compare(df.select(variance(c)), Seq(s.variance))
+}
+
+registerTest(s"$name - count only") {
+  val (df, c) = wrapped()
+  compare(df.select(metrics("count").summary(c), count(c)),
+Seq(Row(exp.count), exp.count))
+}
+
+registerTest(s"$name - count only (direct)") {
+  val (df, c) = wrapped()
+  compare(df.select(count(c)),
+Seq(exp.count))
+}
+
+registerTest(s"$name - numNonZeros only") {
+  val (df, c) = wrapped()
+  compare(df.select(metrics("numNonZeros").summary(c), numNonZeros(c)),
+Seq(Row(exp.numNonZeros), exp.numNonZeros))
+}
+
+registerTest(s"$name - numNonZeros only (direct)") {

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130794375
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric) ++ metrics)
+new SummaryBuilderImpl(typedMetrics, computeMetrics)
+  }
+
+  def mean(col: Co

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130792887
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric) ++ metrics)
+new SummaryBuilderImpl(typedMetrics, computeMetrics)
+  }
+
+  def mean(col: Co

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130793985
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric) ++ metrics)
+new SummaryBuilderImpl(typedMetrics, computeMetrics)
+  }
+
+  def mean(col: Co

[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...

2017-08-01 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18798#discussion_r130793859
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -0,0 +1,633 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.stat
+
+import java.io._
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.sql.Column
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnsafeArrayData}
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, 
Complete, TypedImperativeAggregate}
+import org.apache.spark.sql.catalyst.util.ArrayData
+import org.apache.spark.sql.functions.lit
+import org.apache.spark.sql.types._
+
+/**
+ * A builder object that provides summary statistics about a given column.
+ *
+ * Users should not directly create such builders, but instead use one of 
the methods in
+ * [[Summarizer]].
+ */
+@Since("2.2.0")
+abstract class SummaryBuilder {
+  /**
+   * Returns an aggregate object that contains the summary of the column 
with the requested metrics.
+   * @param featuresCol a column that contains features Vector object.
+   * @param weightCol a column that contains weight value.
+   * @return an aggregate column that contains the statistics. The exact 
content of this
+   * structure is determined during the creation of the builder.
+   */
+  @Since("2.2.0")
+  def summary(featuresCol: Column, weightCol: Column): Column
+
+  @Since("2.2.0")
+  def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0))
+}
+
+/**
+ * Tools for vectorized statistics on MLlib Vectors.
+ *
+ * The methods in this package provide various statistics for Vectors 
contained inside DataFrames.
+ *
+ * This class lets users pick the statistics they would like to extract 
for a given column. Here is
+ * an example in Scala:
+ * {{{
+ *   val dataframe = ... // Some dataframe containing a feature column
+ *   val allStats = dataframe.select(Summarizer.metrics("min", 
"max").summary($"features"))
+ *   val Row(min_, max_) = allStats.first()
+ * }}}
+ *
+ * If one wants to get a single metric, shortcuts are also available:
+ * {{{
+ *   val meanDF = dataframe.select(Summarizer.mean($"features"))
+ *   val Row(mean_) = meanDF.first()
+ * }}}
+ */
+@Since("2.2.0")
+object Summarizer extends Logging {
+
+  import SummaryBuilderImpl._
+
+  /**
+   * Given a list of metrics, provides a builder that it turns computes 
metrics from a column.
+   *
+   * See the documentation of [[Summarizer]] for an example.
+   *
+   * The following metrics are accepted (case sensitive):
+   *  - mean: a vector that contains the coefficient-wise mean.
+   *  - variance: a vector tha contains the coefficient-wise variance.
+   *  - count: the count of all vectors seen.
+   *  - numNonzeros: a vector with the number of non-zeros for each 
coefficients
+   *  - max: the maximum for each coefficient.
+   *  - min: the minimum for each coefficient.
+   *  - normL2: the Euclidian norm for each coefficient.
+   *  - normL1: the L1 norm of each coefficient (sum of the absolute 
values).
+   * @param firstMetric the metric being provided
+   * @param metrics additional metrics that can be provided.
+   * @return a builder.
+   * @throws IllegalArgumentException if one of the metric names is not 
understood.
+   */
+  @Since("2.2.0")
+  def metrics(firstMetric: String, metrics: String*): SummaryBuilder = {
+val (typedMetrics, computeMetrics) = 
getRelevantMetrics(Seq(firstMetric) ++ metrics)
+new SummaryBuilderImpl(typedMetrics, computeMetrics)
+  }
+
+  def mean(col: Co

[GitHub] spark pull request #18619: [SPARK-21397][BUILD]Maven shade plugin adding dep...

2017-08-01 Thread zuotingbing
Github user zuotingbing closed the pull request at:

https://github.com/apache/spark/pull/18619


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18106: [SPARK-20754][SQL] Support TRUNC (number)

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18106
  
**[Test build #80150 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80150/testReport)**
 for PR 18106 at commit 
[`3d40c36`](https://github.com/apache/spark/commit/3d40c366892303cd0de8259b31aebe7a748d89e6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18695: [SPARK-12717][PYTHON] Adding thread-safe broadcast pickl...

2017-08-01 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18695
  
@HyukjinKwon that needs to be added separately by someone who has access to 
Jenkins as admin



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r13079
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -42,6 +43,9 @@ object ArrowUtils {
 case StringType => ArrowType.Utf8.INSTANCE
 case BinaryType => ArrowType.Binary.INSTANCE
 case DecimalType.Fixed(precision, scale) => new 
ArrowType.Decimal(precision, scale)
+case DateType => new ArrowType.Date(DateUnit.DAY)
+case TimestampType =>
+  new ArrowType.Timestamp(TimeUnit.MICROSECOND, 
DateTimeUtils.defaultTimeZone().getID)
--- End diff --

This is wrong, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18804
  
**[Test build #80149 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80149/testReport)**
 for PR 18804 at commit 
[`420be2f`](https://github.com/apache/spark/commit/420be2f28db5f413566c161aa7969db664cd8f3b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-08-01 Thread yaooqinn
Github user yaooqinn commented on a diff in the pull request:

https://github.com/apache/spark/pull/18668#discussion_r130792745
  
--- Diff: 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
 ---
@@ -283,4 +283,17 @@ class CliSuite extends SparkFunSuite with 
BeforeAndAfterAll with Logging {
   "SET conf3;" -> "conftest"
 )
   }
+
+  test("SPARK-21451: spark.sql.warehouse.dir should respect options in 
--hiveconf") {
+runCliWithin(1.minute)("set spark.sql.warehouse.dir;" -> 
warehousePath.getAbsolutePath)
+  }
+
+  test("SPARK-21451: Apply spark.hadoop.* configurations") {
--- End diff --

Yes, after sc initialized, spark.hadoop.hive.metastore.warehouse.dir will 
be translated into a hadoop conf hive.metastore.warehouse.dir as an alternative 
of warehouse dir. This test case couldn't tell whether this pr works. CliSuite 
may not see these values only if we explicitly set them to SqlConf.

The original code did break another test case anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r130792754
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3092,7 +3092,8 @@ class Dataset[T] private[sql](
 val maxRecordsPerBatch = 
sparkSession.sessionState.conf.arrowMaxRecordsPerBatch
 queryExecution.toRdd.mapPartitionsInternal { iter =>
   val context = TaskContext.get()
-  ArrowConverters.toPayloadIterator(iter, schemaCaptured, 
maxRecordsPerBatch, context)
+  ArrowConverters.toPayloadIterator(
+iter, schemaCaptured, maxRecordsPerBatch, context)
--- End diff --

Revert this back?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties from s...

2017-08-01 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/18668
  
There is a bug in HiveClientImpl about reusing cliSessionState, see 
[HiveClientImpl.scala#L140](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L140)

>>  // In `SparkSQLCLIDriver`, we have already started a `CliSessionState`,
// which contains information like configurations from command 
line. Later
// we call `SparkSQLEnv.init()` there, which would run into this 
part again.
// so we should keep `conf` and reuse the existing instance of 
`CliSessionState`

Actually, it is never been reached and reused. `session.SessionState` will 
be re-generated every time when you call `HiveClient.newSession()`

you can run `bin/spark-sql --master local` simply with info log on, you can 
see 
[HiveClientImpl.scala#L193](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L193)
 called four times creating session related directories. 
1.  
[HiveExternalCatalog.scala#L65](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L65)
2. 
[HiveSessionStateBuilder.scala#L45](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45)
3. 
[SparkSQLEnv.scala#L54](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L54)
 which is unnecessary I guess
4. 
[SparkSQLCLIDriver.scala#L115](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L115)
 - which should be reused and it has rights to get all hadoop congfigurations


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18805
  
How big is the dependency that's getting pulled in? If we are adding more 
compression codecs maybe we should retire some old ones, or move them into a 
separate package so downstream apps can optionally depend on them.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18809
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80147/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18809
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18809
  
**[Test build #80147 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80147/testReport)**
 for PR 18809 at commit 
[`d87f4c4`](https://github.com/apache/spark/commit/d87f4c4a63067aba8d1dc4228fda5d94bd5c830c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18806: [SPARK-21600] The description of "this requires s...

2017-08-01 Thread guoxiaolongzte
Github user guoxiaolongzte commented on a diff in the pull request:

https://github.com/apache/spark/pull/18806#discussion_r130788660
  
--- Diff: docs/configuration.md ---
@@ -1638,7 +1638,7 @@ Apart from these, the following properties are also 
available, and may be useful
 For more detail, see the description
 here.
 
-This requires spark.shuffle.service.enabled to be set.
+This requires spark.shuffle.service.enabled to be set 
true.
--- End diff --

Thank you for your comments.
This requires spark.shuffle.service.enabled to be set true. It is very 
clearly. Only such an accurate description,there be no ambiguity.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18808: [SPARK-21605][HOT-FIX][BUILD] Let IntelliJ IDEA correctl...

2017-08-01 Thread baibaichen
Github user baibaichen commented on the issue:

https://github.com/apache/spark/pull/18808
  
https://issues.apache.org/jira/browse/SPARK-21605 is added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18555
  
Thanks! @heary-cao 

cc @jiangxb1987 Could you take a look to ensure no behavior change will be 
caused by this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18805
  
**[Test build #80148 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80148/testReport)**
 for PR 18805 at commit 
[`295f38a`](https://github.com/apache/spark/commit/295f38a808dfdbbba94a83a21708b0597327d195).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18811: [SPARK-21604][SQL]Error class name for log, and if the o...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18811
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130787262
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -50,13 +51,14 @@ private[spark] object CompressionCodec {
 
   private[spark] def supportsConcatenationOfSerializedStreams(codec: 
CompressionCodec): Boolean = {
 (codec.isInstanceOf[SnappyCompressionCodec] || 
codec.isInstanceOf[LZFCompressionCodec]
-  || codec.isInstanceOf[LZ4CompressionCodec])
+  || codec.isInstanceOf[LZ4CompressionCodec] || 
codec.isInstanceOf[ZStandardCompressionCodec])
   }
 
   private val shortCompressionCodecNames = Map(
 "lz4" -> classOf[LZ4CompressionCodec].getName,
 "lzf" -> classOf[LZFCompressionCodec].getName,
-"snappy" -> classOf[SnappyCompressionCodec].getName)
+"snappy" -> classOf[SnappyCompressionCodec].getName,
+"zstd" -> classOf[SnappyCompressionCodec].getName)
--- End diff --

Ah, my bad. Fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130787287
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+val level = 
conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130787269
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130787205
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+val level = 
conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
+val compressionBuffer = 
conf.getSizeAsBytes("spark.io.compression.lz4.blockSize", "32k").toInt
--- End diff --

You are right, we should not share the config with lz4, created a new one.
Lets keep the default to 32kb which is aligned with the block size used by 
other compressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18811: [Spark-21604][SQL]Error class name for log, and i...

2017-08-01 Thread zuotingbing
GitHub user zuotingbing opened a pull request:

https://github.com/apache/spark/pull/18811

[Spark-21604][SQL]Error class name for log, and if the object extends 
Logging, i suggest to remove the var LOG which is useless.

## What changes were proposed in this pull request?

Error class name for log, and if the object extends Logging, i suggest to 
remove the var LOG which is useless.

## How was this patch tested?

Exist tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zuotingbing/spark SPARK-21604

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18811.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18811


commit 3003d6c1c233319d3a2c41d4e8a5823c34b885ad
Author: zuotingbing 
Date:   2017-08-02T05:05:45Z

[SPARK-21604][SQL]Error class name for log

commit 7ea8011eae58467d062e6b5136e4217b567e6551
Author: zuotingbing 
Date:   2017-08-02T05:08:07Z

if the object extends Logging, i suggest to remove the var LOG which is 
useless.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18806: [SPARK-21600] The description of "this requires s...

2017-08-01 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18806#discussion_r130787079
  
--- Diff: docs/configuration.md ---
@@ -1638,7 +1638,7 @@ Apart from these, the following properties are also 
available, and may be useful
 For more detail, see the description
 here.
 
-This requires spark.shuffle.service.enabled to be set.
+This requires spark.shuffle.service.enabled to be set 
true.
--- End diff --

Since other places are clearly defined the property, so there should be no 
ambiguity. Personally I'm not fond of this super nit fix...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18778: [SPARK-21578][CORE] Add JavaSparkContextSuite

2017-08-01 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18778
  
Thank you, @gatorsmile and @srowen !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18778: [SPARK-21578][CORE] Add JavaSparkContextSuite

2017-08-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18778


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18808: [HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Lang...

2017-08-01 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18808
  
This may need a JIRA to track it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18778: [SPARK-21578][CORE] Add JavaSparkContextSuite

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18778
  
Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18810
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...

2017-08-01 Thread eatoncys
GitHub user eatoncys opened a pull request:

https://github.com/apache/spark/pull/18810

[SPARK-21603][sql]The wholestage codegen will be much slower then 
wholestage codegen is closed when the function is too long

## What changes were proposed in this pull request?
Close the whole stage codegen when the function lines is longer than the 
maxlines which will be setted by
spark.sql.codegen.MaxFunctionLength parameter, because when the function is 
too long , it will not get the JIT  optimizing.
A benchmark test result is 10x slower when the generated function is too 
long :

ignore("max function length of wholestagecodegen") {
val N = 20 << 15

val benchmark = new Benchmark("max function length of 
wholestagecodegen", N)
def f(): Unit = sparkSession.range(N)
  .selectExpr(
"id",
"(id & 1023) as k1",
"cast(id & 1023 as double) as k2",
"cast(id & 1023 as int) as k3",
"case when id > 100 and id <= 200 then 1 else 0 end as v1",
"case when id > 200 and id <= 300 then 1 else 0 end as v2",
"case when id > 300 and id <= 400 then 1 else 0 end as v3",
"case when id > 400 and id <= 500 then 1 else 0 end as v4",
"case when id > 500 and id <= 600 then 1 else 0 end as v5",
"case when id > 600 and id <= 700 then 1 else 0 end as v6",
"case when id > 700 and id <= 800 then 1 else 0 end as v7",
"case when id > 800 and id <= 900 then 1 else 0 end as v8",
"case when id > 900 and id <= 1000 then 1 else 0 end as v9",
"case when id > 1000 and id <= 1100 then 1 else 0 end as v10",
"case when id > 1100 and id <= 1200 then 1 else 0 end as v11",
"case when id > 1200 and id <= 1300 then 1 else 0 end as v12",
"case when id > 1300 and id <= 1400 then 1 else 0 end as v13",
"case when id > 1400 and id <= 1500 then 1 else 0 end as v14",
"case when id > 1500 and id <= 1600 then 1 else 0 end as v15",
"case when id > 1600 and id <= 1700 then 1 else 0 end as v16",
"case when id > 1700 and id <= 1800 then 1 else 0 end as v17",
"case when id > 1800 and id <= 1900 then 1 else 0 end as v18")
  .groupBy("k1", "k2", "k3")
  .sum()
  .collect()

benchmark.addCase(s"codegen = F") { iter =>
  sparkSession.conf.set("spark.sql.codegen.wholeStage", "false")
  f()
}

benchmark.addCase(s"codegen = T") { iter =>
  sparkSession.conf.set("spark.sql.codegen.wholeStage", "true")
  sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "1")
  f()
}

benchmark.run()

/*
Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1
Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
max function length of wholestagecodegen: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


codegen = F443 /  507  1.5  
   676.0   1.0X
codegen = T   3279 / 3283  0.2  
  5002.6   0.1X
 */
  }


## How was this patch tested?
Run the unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/eatoncys/spark codegen

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18810.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18810


commit ca9eff68424511fa11cc2bd695f1fddaae178e3c
Author: 10129659 
Date:   2017-08-02T03:48:21Z

The wholestage codegen will be slower when the function is too long

commit 1b0ac5ed896136df3579a61d7ef93980c0647e97
Author: 10129659 
Date:   2017-08-02T04:41:24Z

The wholestage codegen will be slower when the function is too long




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130785462
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables the data schema is stored in the table 
properties.
+  val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
+case Some(provider) => getSchemaFromTableProperties(rawTable)
+case _ => rawTable.schema
--- End diff --

Yeah, I saw your comment 
https://github.com/apache/spark/pull/18804#discussion_r130784019 after post 
https://github.com/apache/spark/pull/18804#discussion_r130784093. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...

2017-08-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18809
  
cc @felixcheung, could you take a look when you have some time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18809
  
**[Test build #80147 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80147/testReport)**
 for PR 18809 at commit 
[`d87f4c4`](https://github.com/apache/spark/commit/d87f4c4a63067aba8d1dc4228fda5d94bd5c830c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18808: [HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Lang...

2017-08-01 Thread baibaichen
Github user baibaichen commented on the issue:

https://github.com/apache/spark/pull/18808
  
cc @gslowikowski , @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18780: [INFRA] Close stale PRs

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18780
  
After we leave polite messages to close their PRs, I think we should still 
keep them open one more week at least. Although it is trivial to reopen it by 
themselves, the feelings are different. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18808: [HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Lang...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18808
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130785255
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables the data schema is stored in the table 
properties.
+  val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
+case Some(provider) => getSchemaFromTableProperties(rawTable)
+case _ => rawTable.schema
--- End diff --

@viirya right. I agree. I was saying that we do have a raw table from a 
prior call. So here we just pass that to restoreTableMetadata like you 
suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18806: [SPARK-21600] The description of "this requires spark.sh...

2017-08-01 Thread guoxiaolongzte
Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/18806
  
@srowen Help review the code,Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18809: [SPARK-21602][R] Add map_keys and map_values func...

2017-08-01 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/18809

[SPARK-21602][R] Add map_keys and map_values functions to R

## What changes were proposed in this pull request?

This PR adds `map_values` and `map_keys` to R API.

```r
> df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
> tmp <- mutate(df, v = create_map(df$model, df$cyl))
> head(select(tmp, map_keys(tmp$v)))
```
```
map_keys(v)
1 Mazda RX4
2 Mazda RX4 Wag
3Datsun 710
4Hornet 4 Drive
5 Hornet Sportabout
6   Valiant
```
```r
> head(select(tmp, map_values(tmp$v)))
```
```
  map_values(v)
1 6
2 6
3 4
4 6
5 8
6 6
```

## How was this patch tested?

Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark map-keys-values-r

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18809.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18809


commit 75b615a5d0728f14a0219cb1f0576cfcb3e1f73d
Author: hyukjinkwon 
Date:   2017-08-02T04:10:29Z

Add map_keys and map_values functions to R

commit d87f4c4a63067aba8d1dc4228fda5d94bd5c830c
Author: hyukjinkwon 
Date:   2017-08-02T04:39:02Z

Add examples for documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18808: [HOT-FIX][BUILD] Let IntelliJ IDEA correctly dete...

2017-08-01 Thread baibaichen
GitHub user baibaichen opened a pull request:

https://github.com/apache/spark/pull/18808

[HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Language level and 
Target byte code version

With SPARK-21592, removing source and target properties from 
maven-compiler-plugin lets IntelliJ IDEA use default Language level and Target 
byte code version which are 1.4.

This change adds source, target and encoding properties back to fix this 
issue.  As I test, it doesn't increase compile time.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/baibaichen/spark feature/idea-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18808.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18808


commit 8f250d9263716043b654e09ce0b7f982ca9a0135
Author: Chang chen 
Date:   2017-08-02T04:45:58Z

[HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Language level and 
Target byte code version

With SPARK-21592, removing source and target properties from 
maven-compiler-plugin lets IntelliJ IDEA use default Language level and Target 
byte code version which are 1.4.

This change adds source, target and encoding properties back to fix this 
issues.  As I test, it doesn't increase compile time.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18806: [SPARK-21600] The description of "this requires s...

2017-08-01 Thread guoxiaolongzte
Github user guoxiaolongzte commented on a diff in the pull request:

https://github.com/apache/spark/pull/18806#discussion_r130784995
  
--- Diff: docs/configuration.md ---
@@ -1638,7 +1638,7 @@ Apart from these, the following properties are also 
available, and may be useful
 For more detail, see the description
 here.
 
-This requires spark.shuffle.service.enabled to be set.
+This requires spark.shuffle.service.enabled to be set 
true.
--- End diff --

You are right, but usually is not very sure. Can not let the user to guess, 
the document needs to be accurately described. In addition, the spark project, 
several other are clearly described, spark.shuffle.service.enabled set true.

![1](https://user-images.githubusercontent.com/26266482/28858102-e488331a-7780-11e7-90da-9390d1659f35.png)

![2](https://user-images.githubusercontent.com/26266482/28858105-ecfb276e-7780-11e7-9f7f-b0d5448dcb62.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18780: [INFRA] Close stale PRs

2017-08-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18780
  
Yes, I just took out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130784093
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables the data schema is stored in the table 
properties.
+  val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
+case Some(provider) => getSchemaFromTableProperties(rawTable)
+case _ => rawTable.schema
--- End diff --

You still need `rawTable`. Call `getTable` will incur another metastore 
access.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130784019
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables the data schema is stored in the table 
properties.
+  val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
+case Some(provider) => getSchemaFromTableProperties(rawTable)
+case _ => rawTable.schema
--- End diff --

@viirya Actually, we do have a raw table here.. so i will just call 
restoreTableMetadata. Thanks a lot @gatorsmile and @viirya 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18780: [INFRA] Close stale PRs

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18780
  
Please take [SPARK-21287] out. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130783643
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables the data schema is stored in the table 
properties.
+  val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
+case Some(provider) => getSchemaFromTableProperties(rawTable)
+case _ => rawTable.schema
--- End diff --

should we call getTable().schema or you guys think its too verbose ?
val schema = getTable(db, table).schema ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130783264
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables the data schema is stored in the table 
properties.
+  val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
+case Some(provider) => getSchemaFromTableProperties(rawTable)
+case _ => rawTable.schema
--- End diff --

Maybe call `restoreTableMetadata` to avoid duplicate logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130783022
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables the data schema is stored in the table 
properties.
+  val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
+case Some(provider) => getSchemaFromTableProperties(rawTable)
+case _ => rawTable.schema
--- End diff --

See the code in 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L755-L758


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130783003
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (stats.get.rowCount.isDefined) {
 statsProperties += STATISTICS_NUM_ROWS -> 
stats.get.rowCount.get.toString()
   }
+
+  // For datasource tables the data schema is stored in the table 
properties.
+  val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match {
+case Some(provider) => getSchemaFromTableProperties(rawTable)
+case _ => rawTable.schema
--- End diff --

For Hive serde tables that were created by Spark 2.1 or later, we should 
still restore it from table properties. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18779: [SPARK-21580][SQL]Integers in aggregation expressions ar...

2017-08-01 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18779
  
@10110346 Can't we also do the same on order by ordinal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...

2017-08-01 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18804
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...

2017-08-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18804#discussion_r130780780
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -117,6 +117,26 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  test("analyze non hive compatible datasource tables") {
+val table = "parquet_tab"
+withTable(table) {
+  sql(
+s"""
+  |CREATE TABLE $table (a int, b int)
+  |USING parquet
+  |OPTIONS (skipHiveMetadata true)
+""".stripMargin)
+  sql(s"insert into $table values (1, 1)")
+  sql(s"insert into $table values (2, 1)")
--- End diff --

nit: minor style issue. `INSERT INTO...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18807: [SPARK-21601][BUILD] Modify the pom.xml file, increase t...

2017-08-01 Thread highfei2011
Github user highfei2011 commented on the issue:

https://github.com/apache/spark/pull/18807
  
Ok,Thanks, @markhamstra .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18807: [SPARK-21601][BUILD] Modify the pom.xml file, inc...

2017-08-01 Thread highfei2011
Github user highfei2011 closed the pull request at:

https://github.com/apache/spark/pull/18807


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication

2017-08-01 Thread KevinZwx
Github user KevinZwx commented on the issue:

https://github.com/apache/spark/pull/16970
  
I'm a little confused with the behavior of dropDuplicates with watermark.  
According to my understanding of the guide documentation, if I have the 
following code, I expect to deduplicate still with uuid but use timestamp 
column and watermark to expire state. 

`.withWatermark("timestamp", "1 day")
.dropDuplicates("uuid", "timestamp")`

But in fact I found that the program probably uses uuid and timestamp as a 
combined key to deduplicate elements because the result count is much larger 
than using dropDuplicates("uuid") and more close to the result with no 
duplication.  Is it the expected behavior?If so how to achieve what I want?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18807: [SPARK-21601][BUILD] Modify the pom.xml file, increase t...

2017-08-01 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/18807
  
These are maven-compiler-plugin configurations. We don't use 
maven-compiler-plugin to compile Java code: 
https://github.com/apache/spark/commit/74cda94c5e496e29f42f1044aab90cab7dbe9d38


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18742
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18742
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80146/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18742
  
**[Test build #80146 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80146/testReport)**
 for PR 18742 at commit 
[`470dd7c`](https://github.com/apache/spark/commit/470dd7ccdb9ea5185494b21cb8886e3597ad505e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18779: [SPARK-21580][SQL]Integers in aggregation expressions ar...

2017-08-01 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/18779
  
@viirya  Only to `group-by ordinal`, i think this is a good idea.
but this will also result in  inconsistent processing  between  `order-by 
ordinal` and   `group-by ordinal`.
and i feel that  it's more complicated than the current changes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpickle aga...

2017-08-01 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/18734
  
If we can reach agreement on this I'll see about trying to get our local 
workarounds upstreamed into cloudpickle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpickle aga...

2017-08-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18734
  
BTW, I also checked it passes tests with Python 3.6 in my local.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...

2017-08-01 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/16774
  
I'm confused by your suggestions here and in  #18733. 

I don't think it's appropriate to just "include" a similar work originated 
from another PR, and suggest another PR to suspend. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18804
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80143/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18804
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18804
  
**[Test build #80143 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80143/testReport)**
 for PR 18804 at commit 
[`0afefd5`](https://github.com/apache/spark/commit/0afefd5dde2ddbe03ded3f0e85c21b5bc65040b3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18742
  
**[Test build #80146 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80146/testReport)**
 for PR 18742 at commit 
[`470dd7c`](https://github.com/apache/spark/commit/470dd7ccdb9ea5185494b21cb8886e3597ad505e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18807: [SPARK-21601][BUILD] Modify the pom.xml file, increase t...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18807
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18807: [SPARK-21601][BUILD] Modify the pom.xml file, inc...

2017-08-01 Thread highfei2011
GitHub user highfei2011 opened a pull request:

https://github.com/apache/spark/pull/18807

[SPARK-21601][BUILD] Modify the pom.xml file, increase the maven compiler 
jdk attribute

## What changes were proposed in this pull request?

When using maven to compile spark, I want to add a modified jdk property. 
This is user-friendly.

## How was this patch tested?

mvn test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/highfei2011/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18807.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18807


commit 607eba2dc27768fbb3f604f14412efbf180f9300
Author: jifei_yang02 
Date:   2017-08-02T02:59:22Z

Modify the pom.xml file, increase the maven compiler jdk attribute.

commit 4a22a8c364ffd8c0d10576a564b8ed47af3f60e5
Author: jifei_yang02 
Date:   2017-08-02T03:18:35Z

[SPARK-21601][BUILD] Modify the pom.xml file, increase the maven compiler 
jdk attribute
## What changes were proposed in this pull request?

Modify the pom.xml file,

## How was this patch tested?

mvn test

Author: highfei2011 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18806: [SPARK-21600] The description of "this requires s...

2017-08-01 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18806#discussion_r130777347
  
--- Diff: docs/configuration.md ---
@@ -1638,7 +1638,7 @@ Apart from these, the following properties are also 
available, and may be useful
 For more detail, see the description
 here.
 
-This requires spark.shuffle.service.enabled to be set.
+This requires spark.shuffle.service.enabled to be set 
true.
--- End diff --

I think there's no ambiguity here. Usually configuration with name 
"xxx.enabled" can only have two values "true" or "false". So "to be set" 
usually means to enable it (to set it to true).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18742
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80145/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18742
  
**[Test build #80145 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80145/testReport)**
 for PR 18742 at commit 
[`ac4cf70`](https://github.com/apache/spark/commit/ac4cf70d7968848638acc080c96f5397275b4655).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18742
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18803: [SPARK-21597][SS]Fix a potential overflow issue in Event...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18803
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80139/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18803: [SPARK-21597][SS]Fix a potential overflow issue in Event...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18803
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18803: [SPARK-21597][SS]Fix a potential overflow issue in Event...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18803
  
**[Test build #80139 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80139/testReport)**
 for PR 18803 at commit 
[`7252e2a`](https://github.com/apache/spark/commit/7252e2ab214a1834d27506a4c25333197c3dfc01).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class EventTimeStats(var max: Long, var min: Long, var avg: 
Double, var count: Long) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...

2017-08-01 Thread ajaysaini725
Github user ajaysaini725 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18746#discussion_r130775557
  
--- Diff: python/pyspark/ml/base.py ---
@@ -116,3 +121,44 @@ class Model(Transformer):
 """
 
 __metaclass__ = ABCMeta
+
+
+@inherit_doc
+class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
+
+@abstractmethod
+def createTransformFunc(self):
+"""
+Creates the transoform function using the given param map.
+"""
+raise NotImplementedError()
+
+@abstractmethod
+def outputDataType(self):
+"""
+Returns the data type of the output column as a sql type
+"""
+raise NotImplementedError()
+
+@abstractmethod
+def validateInputType(self, inputType):
+"""
+Validates the input type. Throws an exception if it is invalid.
+"""
+raise NotImplementedError()
+
+def transformSchema(self, schema):
+inputType = schema[self.getInputCol()].dataType
+self.validateInputType(inputType)
+if self.getOutputCol() in schema.names:
+raise ValueError("Output column %s already exists." % 
self.getOutputCol())
+outputFields = copy.copy(schema.fields)
+outputFields.append(StructField(self.getOutputCol(),
+self.outputDataType(),
+nullable=False))
+return StructType(outputFields)
+
+def transform(self, dataset, paramMap=None):
+transformSchema(dataset.schema())
--- End diff --

Right, I accidentally overrode transform instead of _transform. Fixed!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18742
  
**[Test build #80145 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80145/testReport)**
 for PR 18742 at commit 
[`ac4cf70`](https://github.com/apache/spark/commit/ac4cf70d7968848638acc080c96f5397275b4655).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...

2017-08-01 Thread ajaysaini725
Github user ajaysaini725 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18746#discussion_r130775517
  
--- Diff: python/pyspark/ml/base.py ---
@@ -116,3 +121,44 @@ class Model(Transformer):
 """
 
 __metaclass__ = ABCMeta
+
+
+@inherit_doc
+class UnaryTransformer(HasInputCol, HasOutputCol, Transformer):
+
+@abstractmethod
+def createTransformFunc(self):
+"""
+Creates the transoform function using the given param map.
+"""
+raise NotImplementedError()
+
+@abstractmethod
+def outputDataType(self):
+"""
+Returns the data type of the output column as a sql type
+"""
+raise NotImplementedError()
+
+@abstractmethod
+def validateInputType(self, inputType):
+"""
+Validates the input type. Throws an exception if it is invalid.
+"""
+raise NotImplementedError()
+
+def transformSchema(self, schema):
+inputType = schema[self.getInputCol()].dataType
+self.validateInputType(inputType)
+if self.getOutputCol() in schema.names:
+raise ValueError("Output column %s already exists." % 
self.getOutputCol())
+outputFields = copy.copy(schema.fields)
+outputFields.append(StructField(self.getOutputCol(),
+self.outputDataType(),
+nullable=False))
+return StructType(outputFields)
+
+def transform(self, dataset, paramMap=None):
+transformSchema(dataset.schema())
+transformUDF = udf(self.createTransformFunc(), 
self.outputDataType())
+dataset.withColumn(self.getOutputCol(), 
transformUDF(self.getInputCol()))
--- End diff --

self.createTransformFunc returns a function which is passed to the udf so 
in this case I think it is okay


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18799: [SPARK-21596][SS]Ensure places calling HDFSMetadataLog.g...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18799
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18799: [SPARK-21596][SS]Ensure places calling HDFSMetadataLog.g...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18799
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80138/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18799: [SPARK-21596][SS]Ensure places calling HDFSMetadataLog.g...

2017-08-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18799
  
**[Test build #80138 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80138/testReport)**
 for PR 18799 at commit 
[`91efeb3`](https://github.com/apache/spark/commit/91efeb3553e5e9f0f6fb45ae7574231c2e70d845).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpic...

2017-08-01 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18734#discussion_r130774005
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -220,12 +322,7 @@ def save_function(self, obj, name=None):
 
 if name is None:
 name = obj.__name__
-try:
-# whichmodule() could fail, see
-# 
https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling
-modname = pickle.whichmodule(obj, name)
-except Exception:
-modname = None
+modname = pickle.whichmodule(obj, name)
--- End diff --

It's a good question, the underlying issue was marked resolved in 2014 and 
from looking at the commit it seems like it should actually be resolved. That 
being said its true some people might be on a system installed with an old 
verison of six so perhaps we should keep this workaround.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpic...

2017-08-01 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18734#discussion_r130773575
  
--- Diff: python/pyspark/cloudpickle.py ---
@@ -397,42 +625,7 @@ def save_global(self, obj, name=None, 
pack=struct.pack):
 
 typ = type(obj)
 if typ is not obj and isinstance(obj, (type, types.ClassType)):
-d = dict(obj.__dict__)  # copy dict proxy to a dict
-if not isinstance(d.get('__dict__', None), property):
-# don't extract dict that are properties
-d.pop('__dict__', None)
-d.pop('__weakref__', None)
-
-# hack as __new__ is stored differently in the __dict__
-new_override = d.get('__new__', None)
-if new_override:
-d['__new__'] = obj.__new__
-
-# workaround for namedtuple (hijacked by PySpark)
-if getattr(obj, '_is_namedtuple_', False):
-self.save_reduce(_load_namedtuple, (obj.__name__, 
obj._fields))
-return
-
-self.save(_load_class)
-self.save_reduce(typ, (obj.__name__, obj.__bases__, 
{"__doc__": obj.__doc__}), obj=obj)
-d.pop('__doc__', None)
-# handle property and staticmethod
-dd = {}
-for k, v in d.items():
--- End diff --

Lets double check this part with @davies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/18805
  
re build failure: you can repro that locally by running 
"./dev/test-dependencies.sh". Its failing due to introducing a new dep... you 
need to add it to `dev/deps/spark-deps-hadoop-XXX`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18806: [SPARK-21600] The description of "this requires spark.sh...

2017-08-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18806
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130769482
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -50,13 +51,14 @@ private[spark] object CompressionCodec {
 
   private[spark] def supportsConcatenationOfSerializedStreams(codec: 
CompressionCodec): Boolean = {
 (codec.isInstanceOf[SnappyCompressionCodec] || 
codec.isInstanceOf[LZFCompressionCodec]
-  || codec.isInstanceOf[LZ4CompressionCodec])
+  || codec.isInstanceOf[LZ4CompressionCodec] || 
codec.isInstanceOf[ZStandardCompressionCodec])
   }
 
   private val shortCompressionCodecNames = Map(
 "lz4" -> classOf[LZ4CompressionCodec].getName,
 "lzf" -> classOf[LZFCompressionCodec].getName,
-"snappy" -> classOf[SnappyCompressionCodec].getName)
+"snappy" -> classOf[SnappyCompressionCodec].getName,
+"zstd" -> classOf[SnappyCompressionCodec].getName)
--- End diff --

you mean `ZStandardCompressionCodec` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/18805
  
In `Benchmark` section the values for `Lz4` are all zeros which feels 
confusing while reading.. first thing I thought is they were absolute values 
but they are supposed to be relative


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >