[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130798893 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,13 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables and hive serde tables created by spark 2.1 or higher, --- End diff -- Also add a test for hive serde tables? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties from s...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/18668 @vanzin > the configuration of the execution Hive Does this mean a hive client initialized by [HiveUtils.newClientForExecution](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L241)? If true, this is ONLY used in [HiveThiftSever2](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L88) after SparkContext initialized. Example: `sbin/start-thriftserver.sh --conf spark.hadoop.hive.server2.thrift.port=11001 --hiveconf hive.server2.thrift.port=11000` Spark Thrift Server will take 11001 as the port. `hive.server2.thrift.port` firstly be parsed `11000`, but it will be re-writted to `11001` [HiveThriftServer2.scala#L90](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L90) after SparkSQLEnv init the sc. IMO the `spark.hadoop.xxx` properties can be treat as special spark properties, which should have higher priority than its original form `xxx`. In SparkSQLCliDriver we shall obey this rule too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs on requ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18812 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18812: [SPARK-21606][SQL]HiveThriftServer2 catches OOMs ...
GitHub user zuotingbing opened a pull request: https://github.com/apache/spark/pull/18812 [SPARK-21606][SQL]HiveThriftServer2 catches OOMs on request threads ## What changes were proposed in this pull request? Refer to Hive, ThriftCLIService methods such as ExecuteStatement are apparently capable of catching OOMs because they get wrapped in RTE by HiveSessionProxy. I create a PR to fix this bug in Spark also. ## How was this patch tested? Exist tests and manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zuotingbing/spark OOM_HiveThriftServer2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18812.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18812 commit 759330df5685a6f162d9c9666db03b08148b5ba9 Author: zuotingbing Date: 2017-08-02T06:44:08Z [SPARK-21606][SQL]HiveThriftServer2 catches OOMs on request threads --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18808: [SPARK-21605][HOT-FIX][BUILD] Let IntelliJ IDEA correctl...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/18808 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r130795748 --- Diff: python/pyspark/sql/tests.py --- @@ -3036,6 +3052,9 @@ def test_toPandas_arrow_toggle(self): pdf = df.toPandas() self.spark.conf.set("spark.sql.execution.arrow.enable", "true") --- End diff -- We do not set it back after converting to `true`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18808: [SPARK-21605][HOT-FIX][BUILD] Let IntelliJ IDEA correctl...
Github user gslowikowski commented on the issue: https://github.com/apache/spark/pull/18808 I removed too much in #18750. Eclipse uses these two configuration parameters of `maven-compiler-plugin` too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130794138 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric) ++ metrics) +new SummaryBuilderImpl(typedMetrics, computeMetrics) + } + + def mean(col: Co
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130794916 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala --- @@ -0,0 +1,619 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import org.scalatest.exceptions.TestFailedException + +import org.apache.spark.{SparkException, SparkFunSuite} +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => OldVectors} +import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, Statistics} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema + +class SummarizerSuite extends SparkFunSuite with MLlibTestSparkContext { + + import testImplicits._ + import Summarizer._ + import SummaryBuilderImpl._ + + private case class ExpectedMetrics( + mean: Seq[Double], + variance: Seq[Double], + count: Long, + numNonZeros: Seq[Long], + max: Seq[Double], + min: Seq[Double], + normL2: Seq[Double], + normL1: Seq[Double]) + + // The input is expected to be either a sparse vector, a dense vector or an array of doubles + // (which will be converted to a dense vector) + // The expected is the list of all the known metrics. + // + // The tests take an list of input vectors and a list of all the summary values that + // are expected for this input. They currently test against some fixed subset of the + // metrics, but should be made fuzzy in the future. + + private def testExample(name: String, input: Seq[Any], exp: ExpectedMetrics): Unit = { +def inputVec: Seq[Vector] = input.map { + case x: Array[Double @unchecked] => Vectors.dense(x) + case x: Seq[Double @unchecked] => Vectors.dense(x.toArray) + case x: Vector => x + case x => throw new Exception(x.toString) +} + +val s = { + val s2 = new MultivariateOnlineSummarizer + inputVec.foreach(v => s2.add(OldVectors.fromML(v))) + s2 +} + +// Because the Spark context is reset between tests, we cannot hold a reference onto it. +def wrapped() = { + val df = sc.parallelize(inputVec).map(Tuple1.apply).toDF("features") + val c = df.col("features") + (df, c) +} + +registerTest(s"$name - mean only") { + val (df, c) = wrapped() + compare(df.select(metrics("mean").summary(c), mean(c)), Seq(Row(exp.mean), s.mean)) +} + +registerTest(s"$name - mean only (direct)") { + val (df, c) = wrapped() + compare(df.select(mean(c)), Seq(exp.mean)) +} + +registerTest(s"$name - variance only") { + val (df, c) = wrapped() + compare(df.select(metrics("variance").summary(c), variance(c)), +Seq(Row(exp.variance), s.variance)) +} + +registerTest(s"$name - variance only (direct)") { + val (df, c) = wrapped() + compare(df.select(variance(c)), Seq(s.variance)) +} + +registerTest(s"$name - count only") { + val (df, c) = wrapped() + compare(df.select(metrics("count").summary(c), count(c)), +Seq(Row(exp.count), exp.count)) +} + +registerTest(s"$name - count only (direct)") { + val (df, c) = wrapped() + compare(df.select(count(c)), +Seq(exp.count)) +} + +registerTest(s"$name - numNonZeros only") { + val (df, c) = wrapped() + compare(df.select(metrics("numNonZeros").summary(c), numNonZeros(c)), +Seq(Row(exp.numNonZeros), exp.numNonZeros)) +} + +registerTest(s"$name - numNonZeros only (direct)") {
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130794375 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric) ++ metrics) +new SummaryBuilderImpl(typedMetrics, computeMetrics) + } + + def mean(col: Co
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130792887 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric) ++ metrics) +new SummaryBuilderImpl(typedMetrics, computeMetrics) + } + + def mean(col: Co
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130793985 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric) ++ metrics) +new SummaryBuilderImpl(typedMetrics, computeMetrics) + } + + def mean(col: Co
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r130793859 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,633 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.2.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.2.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.2.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + */ +@Since("2.2.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + */ + @Since("2.2.0") + def metrics(firstMetric: String, metrics: String*): SummaryBuilder = { +val (typedMetrics, computeMetrics) = getRelevantMetrics(Seq(firstMetric) ++ metrics) +new SummaryBuilderImpl(typedMetrics, computeMetrics) + } + + def mean(col: Co
[GitHub] spark pull request #18619: [SPARK-21397][BUILD]Maven shade plugin adding dep...
Github user zuotingbing closed the pull request at: https://github.com/apache/spark/pull/18619 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18106: [SPARK-20754][SQL] Support TRUNC (number)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18106 **[Test build #80150 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80150/testReport)** for PR 18106 at commit [`3d40c36`](https://github.com/apache/spark/commit/3d40c366892303cd0de8259b31aebe7a748d89e6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18695: [SPARK-12717][PYTHON] Adding thread-safe broadcast pickl...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/18695 @HyukjinKwon that needs to be added separately by someone who has access to Jenkins as admin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r13079 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -42,6 +43,9 @@ object ArrowUtils { case StringType => ArrowType.Utf8.INSTANCE case BinaryType => ArrowType.Binary.INSTANCE case DecimalType.Fixed(precision, scale) => new ArrowType.Decimal(precision, scale) +case DateType => new ArrowType.Date(DateUnit.DAY) +case TimestampType => + new ArrowType.Timestamp(TimeUnit.MICROSECOND, DateTimeUtils.defaultTimeZone().getID) --- End diff -- This is wrong, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18804 **[Test build #80149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80149/testReport)** for PR 18804 at commit [`420be2f`](https://github.com/apache/spark/commit/420be2f28db5f413566c161aa7969db664cd8f3b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...
Github user yaooqinn commented on a diff in the pull request: https://github.com/apache/spark/pull/18668#discussion_r130792745 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala --- @@ -283,4 +283,17 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { "SET conf3;" -> "conftest" ) } + + test("SPARK-21451: spark.sql.warehouse.dir should respect options in --hiveconf") { +runCliWithin(1.minute)("set spark.sql.warehouse.dir;" -> warehousePath.getAbsolutePath) + } + + test("SPARK-21451: Apply spark.hadoop.* configurations") { --- End diff -- Yes, after sc initialized, spark.hadoop.hive.metastore.warehouse.dir will be translated into a hadoop conf hive.metastore.warehouse.dir as an alternative of warehouse dir. This test case couldn't tell whether this pr works. CliSuite may not see these values only if we explicitly set them to SqlConf. The original code did break another test case anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r130792754 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3092,7 +3092,8 @@ class Dataset[T] private[sql]( val maxRecordsPerBatch = sparkSession.sessionState.conf.arrowMaxRecordsPerBatch queryExecution.toRdd.mapPartitionsInternal { iter => val context = TaskContext.get() - ArrowConverters.toPayloadIterator(iter, schemaCaptured, maxRecordsPerBatch, context) + ArrowConverters.toPayloadIterator( +iter, schemaCaptured, maxRecordsPerBatch, context) --- End diff -- Revert this back? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties from s...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/18668 There is a bug in HiveClientImpl about reusing cliSessionState, see [HiveClientImpl.scala#L140](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L140) >> // In `SparkSQLCLIDriver`, we have already started a `CliSessionState`, // which contains information like configurations from command line. Later // we call `SparkSQLEnv.init()` there, which would run into this part again. // so we should keep `conf` and reuse the existing instance of `CliSessionState` Actually, it is never been reached and reused. `session.SessionState` will be re-generated every time when you call `HiveClient.newSession()` you can run `bin/spark-sql --master local` simply with info log on, you can see [HiveClientImpl.scala#L193](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L193) called four times creating session related directories. 1. [HiveExternalCatalog.scala#L65](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L65) 2. [HiveSessionStateBuilder.scala#L45](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L45) 3. [SparkSQLEnv.scala#L54](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L54) which is unnecessary I guess 4. [SparkSQLCLIDriver.scala#L115](https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L115) - which should be reused and it has rights to get all hadoop congfigurations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18805 How big is the dependency that's getting pulled in? If we are adding more compression codecs maybe we should retire some old ones, or move them into a separate package so downstream apps can optionally depend on them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18809 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80147/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18809 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18809 **[Test build #80147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80147/testReport)** for PR 18809 at commit [`d87f4c4`](https://github.com/apache/spark/commit/d87f4c4a63067aba8d1dc4228fda5d94bd5c830c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18806: [SPARK-21600] The description of "this requires s...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/18806#discussion_r130788660 --- Diff: docs/configuration.md --- @@ -1638,7 +1638,7 @@ Apart from these, the following properties are also available, and may be useful For more detail, see the description here. -This requires spark.shuffle.service.enabled to be set. +This requires spark.shuffle.service.enabled to be set true. --- End diff -- Thank you for your comments. This requires spark.shuffle.service.enabled to be set true. It is very clearly. Only such an accurate descriptionï¼there be no ambiguity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18808: [SPARK-21605][HOT-FIX][BUILD] Let IntelliJ IDEA correctl...
Github user baibaichen commented on the issue: https://github.com/apache/spark/pull/18808 https://issues.apache.org/jira/browse/SPARK-21605 is added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18555 Thanks! @heary-cao cc @jiangxb1987 Could you take a look to ensure no behavior change will be caused by this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18805 **[Test build #80148 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80148/testReport)** for PR 18805 at commit [`295f38a`](https://github.com/apache/spark/commit/295f38a808dfdbbba94a83a21708b0597327d195). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18811: [SPARK-21604][SQL]Error class name for log, and if the o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18811 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user sitalkedia commented on a diff in the pull request: https://github.com/apache/spark/pull/18805#discussion_r130787262 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -50,13 +51,14 @@ private[spark] object CompressionCodec { private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = { (codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec] - || codec.isInstanceOf[LZ4CompressionCodec]) + || codec.isInstanceOf[LZ4CompressionCodec] || codec.isInstanceOf[ZStandardCompressionCodec]) } private val shortCompressionCodecNames = Map( "lz4" -> classOf[LZ4CompressionCodec].getName, "lzf" -> classOf[LZFCompressionCodec].getName, -"snappy" -> classOf[SnappyCompressionCodec].getName) +"snappy" -> classOf[SnappyCompressionCodec].getName, +"zstd" -> classOf[SnappyCompressionCodec].getName) --- End diff -- Ah, my bad. Fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user sitalkedia commented on a diff in the pull request: https://github.com/apache/spark/pull/18805#discussion_r130787287 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou } } } + +/** + * :: DeveloperApi :: + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. + * + * @note The wire protocol for this codec is not guaranteed to be compatible across versions + * of Spark. This is intended for use as an internal compression utility within a single Spark + * application. + */ +@DeveloperApi +class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec { + + override def compressedOutputStream(s: OutputStream): OutputStream = { +val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user sitalkedia commented on a diff in the pull request: https://github.com/apache/spark/pull/18805#discussion_r130787269 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou } } } + +/** + * :: DeveloperApi :: + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user sitalkedia commented on a diff in the pull request: https://github.com/apache/spark/pull/18805#discussion_r130787205 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: SnappyOutputStream) extends Ou } } } + +/** + * :: DeveloperApi :: + * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. + * + * @note The wire protocol for this codec is not guaranteed to be compatible across versions + * of Spark. This is intended for use as an internal compression utility within a single Spark + * application. + */ +@DeveloperApi +class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec { + + override def compressedOutputStream(s: OutputStream): OutputStream = { +val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt +val compressionBuffer = conf.getSizeAsBytes("spark.io.compression.lz4.blockSize", "32k").toInt --- End diff -- You are right, we should not share the config with lz4, created a new one. Lets keep the default to 32kb which is aligned with the block size used by other compressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18811: [Spark-21604][SQL]Error class name for log, and i...
GitHub user zuotingbing opened a pull request: https://github.com/apache/spark/pull/18811 [Spark-21604][SQL]Error class name for log, and if the object extends Logging, i suggest to remove the var LOG which is useless. ## What changes were proposed in this pull request? Error class name for log, and if the object extends Logging, i suggest to remove the var LOG which is useless. ## How was this patch tested? Exist tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zuotingbing/spark SPARK-21604 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18811.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18811 commit 3003d6c1c233319d3a2c41d4e8a5823c34b885ad Author: zuotingbing Date: 2017-08-02T05:05:45Z [SPARK-21604][SQL]Error class name for log commit 7ea8011eae58467d062e6b5136e4217b567e6551 Author: zuotingbing Date: 2017-08-02T05:08:07Z if the object extends Logging, i suggest to remove the var LOG which is useless. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18806: [SPARK-21600] The description of "this requires s...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18806#discussion_r130787079 --- Diff: docs/configuration.md --- @@ -1638,7 +1638,7 @@ Apart from these, the following properties are also available, and may be useful For more detail, see the description here. -This requires spark.shuffle.service.enabled to be set. +This requires spark.shuffle.service.enabled to be set true. --- End diff -- Since other places are clearly defined the property, so there should be no ambiguity. Personally I'm not fond of this super nit fix... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18778: [SPARK-21578][CORE] Add JavaSparkContextSuite
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18778 Thank you, @gatorsmile and @srowen ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18778: [SPARK-21578][CORE] Add JavaSparkContextSuite
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18778 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18808: [HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Lang...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18808 This may need a JIRA to track it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18778: [SPARK-21578][CORE] Add JavaSparkContextSuite
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18778 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18810: [SPARK-21603][sql]The wholestage codegen will be much sl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18810 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18810: [SPARK-21603][sql]The wholestage codegen will be ...
GitHub user eatoncys opened a pull request: https://github.com/apache/spark/pull/18810 [SPARK-21603][sql]The wholestage codegen will be much slower then wholestage codegen is closed when the function is too long ## What changes were proposed in this pull request? Close the whole stage codegen when the function lines is longer than the maxlines which will be setted by spark.sql.codegen.MaxFunctionLength parameter, because when the function is too long , it will not get the JIT optimizing. A benchmark test result is 10x slower when the generated function is too long : ignore("max function length of wholestagecodegen") { val N = 20 << 15 val benchmark = new Benchmark("max function length of wholestagecodegen", N) def f(): Unit = sparkSession.range(N) .selectExpr( "id", "(id & 1023) as k1", "cast(id & 1023 as double) as k2", "cast(id & 1023 as int) as k3", "case when id > 100 and id <= 200 then 1 else 0 end as v1", "case when id > 200 and id <= 300 then 1 else 0 end as v2", "case when id > 300 and id <= 400 then 1 else 0 end as v3", "case when id > 400 and id <= 500 then 1 else 0 end as v4", "case when id > 500 and id <= 600 then 1 else 0 end as v5", "case when id > 600 and id <= 700 then 1 else 0 end as v6", "case when id > 700 and id <= 800 then 1 else 0 end as v7", "case when id > 800 and id <= 900 then 1 else 0 end as v8", "case when id > 900 and id <= 1000 then 1 else 0 end as v9", "case when id > 1000 and id <= 1100 then 1 else 0 end as v10", "case when id > 1100 and id <= 1200 then 1 else 0 end as v11", "case when id > 1200 and id <= 1300 then 1 else 0 end as v12", "case when id > 1300 and id <= 1400 then 1 else 0 end as v13", "case when id > 1400 and id <= 1500 then 1 else 0 end as v14", "case when id > 1500 and id <= 1600 then 1 else 0 end as v15", "case when id > 1600 and id <= 1700 then 1 else 0 end as v16", "case when id > 1700 and id <= 1800 then 1 else 0 end as v17", "case when id > 1800 and id <= 1900 then 1 else 0 end as v18") .groupBy("k1", "k2", "k3") .sum() .collect() benchmark.addCase(s"codegen = F") { iter => sparkSession.conf.set("spark.sql.codegen.wholeStage", "false") f() } benchmark.addCase(s"codegen = T") { iter => sparkSession.conf.set("spark.sql.codegen.wholeStage", "true") sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "1") f() } benchmark.run() /* Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1 Intel64 Family 6 Model 58 Stepping 9, GenuineIntel max function length of wholestagecodegen: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative codegen = F443 / 507 1.5 676.0 1.0X codegen = T 3279 / 3283 0.2 5002.6 0.1X */ } ## How was this patch tested? Run the unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/eatoncys/spark codegen Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18810.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18810 commit ca9eff68424511fa11cc2bd695f1fddaae178e3c Author: 10129659 Date: 2017-08-02T03:48:21Z The wholestage codegen will be slower when the function is too long commit 1b0ac5ed896136df3579a61d7ef93980c0647e97 Author: 10129659 Date: 2017-08-02T04:41:24Z The wholestage codegen will be slower when the function is too long --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130785462 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables the data schema is stored in the table properties. + val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { +case Some(provider) => getSchemaFromTableProperties(rawTable) +case _ => rawTable.schema --- End diff -- Yeah, I saw your comment https://github.com/apache/spark/pull/18804#discussion_r130784019 after post https://github.com/apache/spark/pull/18804#discussion_r130784093. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18809 cc @felixcheung, could you take a look when you have some time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18809: [SPARK-21602][R] Add map_keys and map_values functions t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18809 **[Test build #80147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80147/testReport)** for PR 18809 at commit [`d87f4c4`](https://github.com/apache/spark/commit/d87f4c4a63067aba8d1dc4228fda5d94bd5c830c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18808: [HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Lang...
Github user baibaichen commented on the issue: https://github.com/apache/spark/pull/18808 cc @gslowikowski , @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18780: [INFRA] Close stale PRs
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18780 After we leave polite messages to close their PRs, I think we should still keep them open one more week at least. Although it is trivial to reopen it by themselves, the feelings are different. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18808: [HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Lang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18808 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130785255 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables the data schema is stored in the table properties. + val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { +case Some(provider) => getSchemaFromTableProperties(rawTable) +case _ => rawTable.schema --- End diff -- @viirya right. I agree. I was saying that we do have a raw table from a prior call. So here we just pass that to restoreTableMetadata like you suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18806: [SPARK-21600] The description of "this requires spark.sh...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/18806 @srowen Help review the code,Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18809: [SPARK-21602][R] Add map_keys and map_values func...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/18809 [SPARK-21602][R] Add map_keys and map_values functions to R ## What changes were proposed in this pull request? This PR adds `map_values` and `map_keys` to R API. ```r > df <- createDataFrame(cbind(model = rownames(mtcars), mtcars)) > tmp <- mutate(df, v = create_map(df$model, df$cyl)) > head(select(tmp, map_keys(tmp$v))) ``` ``` map_keys(v) 1 Mazda RX4 2 Mazda RX4 Wag 3Datsun 710 4Hornet 4 Drive 5 Hornet Sportabout 6 Valiant ``` ```r > head(select(tmp, map_values(tmp$v))) ``` ``` map_values(v) 1 6 2 6 3 4 4 6 5 8 6 6 ``` ## How was this patch tested? Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark map-keys-values-r Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18809.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18809 commit 75b615a5d0728f14a0219cb1f0576cfcb3e1f73d Author: hyukjinkwon Date: 2017-08-02T04:10:29Z Add map_keys and map_values functions to R commit d87f4c4a63067aba8d1dc4228fda5d94bd5c830c Author: hyukjinkwon Date: 2017-08-02T04:39:02Z Add examples for documentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18808: [HOT-FIX][BUILD] Let IntelliJ IDEA correctly dete...
GitHub user baibaichen opened a pull request: https://github.com/apache/spark/pull/18808 [HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Language level and Target byte code version With SPARK-21592, removing source and target properties from maven-compiler-plugin lets IntelliJ IDEA use default Language level and Target byte code version which are 1.4. This change adds source, target and encoding properties back to fix this issue. As I test, it doesn't increase compile time. You can merge this pull request into a Git repository by running: $ git pull https://github.com/baibaichen/spark feature/idea-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18808.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18808 commit 8f250d9263716043b654e09ce0b7f982ca9a0135 Author: Chang chen Date: 2017-08-02T04:45:58Z [HOT-FIX][BUILD] Let IntelliJ IDEA correctly detect Language level and Target byte code version With SPARK-21592, removing source and target properties from maven-compiler-plugin lets IntelliJ IDEA use default Language level and Target byte code version which are 1.4. This change adds source, target and encoding properties back to fix this issues. As I test, it doesn't increase compile time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18806: [SPARK-21600] The description of "this requires s...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/18806#discussion_r130784995 --- Diff: docs/configuration.md --- @@ -1638,7 +1638,7 @@ Apart from these, the following properties are also available, and may be useful For more detail, see the description here. -This requires spark.shuffle.service.enabled to be set. +This requires spark.shuffle.service.enabled to be set true. --- End diff -- You are right, but usually is not very sure. Can not let the user to guess, the document needs to be accurately described. In addition, the spark project, several other are clearly described, spark.shuffle.service.enabled set true. ![1](https://user-images.githubusercontent.com/26266482/28858102-e488331a-7780-11e7-90da-9390d1659f35.png) ![2](https://user-images.githubusercontent.com/26266482/28858105-ecfb276e-7780-11e7-9f7f-b0d5448dcb62.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18780: [INFRA] Close stale PRs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18780 Yes, I just took out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130784093 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables the data schema is stored in the table properties. + val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { +case Some(provider) => getSchemaFromTableProperties(rawTable) +case _ => rawTable.schema --- End diff -- You still need `rawTable`. Call `getTable` will incur another metastore access. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130784019 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables the data schema is stored in the table properties. + val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { +case Some(provider) => getSchemaFromTableProperties(rawTable) +case _ => rawTable.schema --- End diff -- @viirya Actually, we do have a raw table here.. so i will just call restoreTableMetadata. Thanks a lot @gatorsmile and @viirya --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18780: [INFRA] Close stale PRs
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18780 Please take [SPARK-21287] out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130783643 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables the data schema is stored in the table properties. + val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { +case Some(provider) => getSchemaFromTableProperties(rawTable) +case _ => rawTable.schema --- End diff -- should we call getTable().schema or you guys think its too verbose ? val schema = getTable(db, table).schema ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130783264 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables the data schema is stored in the table properties. + val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { +case Some(provider) => getSchemaFromTableProperties(rawTable) +case _ => rawTable.schema --- End diff -- Maybe call `restoreTableMetadata` to avoid duplicate logic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130783022 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables the data schema is stored in the table properties. + val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { +case Some(provider) => getSchemaFromTableProperties(rawTable) +case _ => rawTable.schema --- End diff -- See the code in https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L755-L758 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130783003 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -642,8 +642,15 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat if (stats.get.rowCount.isDefined) { statsProperties += STATISTICS_NUM_ROWS -> stats.get.rowCount.get.toString() } + + // For datasource tables the data schema is stored in the table properties. + val schema = rawTable.properties.get(DATASOURCE_PROVIDER) match { +case Some(provider) => getSchemaFromTableProperties(rawTable) +case _ => rawTable.schema --- End diff -- For Hive serde tables that were created by Spark 2.1 or later, we should still restore it from table properties. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18779: [SPARK-21580][SQL]Integers in aggregation expressions ar...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18779 @10110346 Can't we also do the same on order by ordinal? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18804 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18804: [SPARK-21599][SQL] Collecting column statistics f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18804#discussion_r130780780 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -117,6 +117,26 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } } + test("analyze non hive compatible datasource tables") { +val table = "parquet_tab" +withTable(table) { + sql( +s""" + |CREATE TABLE $table (a int, b int) + |USING parquet + |OPTIONS (skipHiveMetadata true) +""".stripMargin) + sql(s"insert into $table values (1, 1)") + sql(s"insert into $table values (2, 1)") --- End diff -- nit: minor style issue. `INSERT INTO...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18807: [SPARK-21601][BUILD] Modify the pom.xml file, increase t...
Github user highfei2011 commented on the issue: https://github.com/apache/spark/pull/18807 Ok,Thanks, @markhamstra . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18807: [SPARK-21601][BUILD] Modify the pom.xml file, inc...
Github user highfei2011 closed the pull request at: https://github.com/apache/spark/pull/18807 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication
Github user KevinZwx commented on the issue: https://github.com/apache/spark/pull/16970 I'm a little confused with the behavior of dropDuplicates with watermark. According to my understanding of the guide documentation, if I have the following code, I expect to deduplicate still with uuid but use timestamp column and watermark to expire state. `.withWatermark("timestamp", "1 day") .dropDuplicates("uuid", "timestamp")` But in fact I found that the program probably uses uuid and timestamp as a combined key to deduplicate elements because the result count is much larger than using dropDuplicates("uuid") and more close to the result with no duplication. Is it the expected behaviorï¼If so how to achieve what I want? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18807: [SPARK-21601][BUILD] Modify the pom.xml file, increase t...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/18807 These are maven-compiler-plugin configurations. We don't use maven-compiler-plugin to compile Java code: https://github.com/apache/spark/commit/74cda94c5e496e29f42f1044aab90cab7dbe9d38 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18742 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18742 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80146/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18742 **[Test build #80146 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80146/testReport)** for PR 18742 at commit [`470dd7c`](https://github.com/apache/spark/commit/470dd7ccdb9ea5185494b21cb8886e3597ad505e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18779: [SPARK-21580][SQL]Integers in aggregation expressions ar...
Github user 10110346 commented on the issue: https://github.com/apache/spark/pull/18779 @viirya Only to `group-by ordinal`, i think this is a good idea. but this will also result in inconsistent processing between `order-by ordinal` and `group-by ordinal`. and i feel that it's more complicated than the current changes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpickle aga...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/18734 If we can reach agreement on this I'll see about trying to get our local workarounds upstreamed into cloudpickle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpickle aga...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18734 BTW, I also checked it passes tests with Python 3.6 in my local. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/16774 I'm confused by your suggestions here and in #18733. I don't think it's appropriate to just "include" a similar work originated from another PR, and suggest another PR to suspend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18804 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80143/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18804 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18804: [SPARK-21599][SQL] Collecting column statistics for data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18804 **[Test build #80143 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80143/testReport)** for PR 18804 at commit [`0afefd5`](https://github.com/apache/spark/commit/0afefd5dde2ddbe03ded3f0e85c21b5bc65040b3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18742 **[Test build #80146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80146/testReport)** for PR 18742 at commit [`470dd7c`](https://github.com/apache/spark/commit/470dd7ccdb9ea5185494b21cb8886e3597ad505e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18807: [SPARK-21601][BUILD] Modify the pom.xml file, increase t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18807 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18807: [SPARK-21601][BUILD] Modify the pom.xml file, inc...
GitHub user highfei2011 opened a pull request: https://github.com/apache/spark/pull/18807 [SPARK-21601][BUILD] Modify the pom.xml file, increase the maven compiler jdk attribute ## What changes were proposed in this pull request? When using maven to compile spark, I want to add a modified jdk property. This is user-friendly. ## How was this patch tested? mvn test You can merge this pull request into a Git repository by running: $ git pull https://github.com/highfei2011/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18807.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18807 commit 607eba2dc27768fbb3f604f14412efbf180f9300 Author: jifei_yang02 Date: 2017-08-02T02:59:22Z Modify the pom.xml file, increase the maven compiler jdk attribute. commit 4a22a8c364ffd8c0d10576a564b8ed47af3f60e5 Author: jifei_yang02 Date: 2017-08-02T03:18:35Z [SPARK-21601][BUILD] Modify the pom.xml file, increase the maven compiler jdk attribute ## What changes were proposed in this pull request? Modify the pom.xml file, ## How was this patch tested? mvn test Author: highfei2011 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18806: [SPARK-21600] The description of "this requires s...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18806#discussion_r130777347 --- Diff: docs/configuration.md --- @@ -1638,7 +1638,7 @@ Apart from these, the following properties are also available, and may be useful For more detail, see the description here. -This requires spark.shuffle.service.enabled to be set. +This requires spark.shuffle.service.enabled to be set true. --- End diff -- I think there's no ambiguity here. Usually configuration with name "xxx.enabled" can only have two values "true" or "false". So "to be set" usually means to enable it (to set it to true). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18742 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80145/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18742 **[Test build #80145 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80145/testReport)** for PR 18742 at commit [`ac4cf70`](https://github.com/apache/spark/commit/ac4cf70d7968848638acc080c96f5397275b4655). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18742 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18803: [SPARK-21597][SS]Fix a potential overflow issue in Event...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18803 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80139/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18803: [SPARK-21597][SS]Fix a potential overflow issue in Event...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18803 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18803: [SPARK-21597][SS]Fix a potential overflow issue in Event...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18803 **[Test build #80139 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80139/testReport)** for PR 18803 at commit [`7252e2a`](https://github.com/apache/spark/commit/7252e2ab214a1834d27506a4c25333197c3dfc01). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class EventTimeStats(var max: Long, var min: Long, var avg: Double, var count: Long) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...
Github user ajaysaini725 commented on a diff in the pull request: https://github.com/apache/spark/pull/18746#discussion_r130775557 --- Diff: python/pyspark/ml/base.py --- @@ -116,3 +121,44 @@ class Model(Transformer): """ __metaclass__ = ABCMeta + + +@inherit_doc +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer): + +@abstractmethod +def createTransformFunc(self): +""" +Creates the transoform function using the given param map. +""" +raise NotImplementedError() + +@abstractmethod +def outputDataType(self): +""" +Returns the data type of the output column as a sql type +""" +raise NotImplementedError() + +@abstractmethod +def validateInputType(self, inputType): +""" +Validates the input type. Throws an exception if it is invalid. +""" +raise NotImplementedError() + +def transformSchema(self, schema): +inputType = schema[self.getInputCol()].dataType +self.validateInputType(inputType) +if self.getOutputCol() in schema.names: +raise ValueError("Output column %s already exists." % self.getOutputCol()) +outputFields = copy.copy(schema.fields) +outputFields.append(StructField(self.getOutputCol(), +self.outputDataType(), +nullable=False)) +return StructType(outputFields) + +def transform(self, dataset, paramMap=None): +transformSchema(dataset.schema()) --- End diff -- Right, I accidentally overrode transform instead of _transform. Fixed! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18742: [Spark-21542][ML][Python]Python persistence helper funct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18742 **[Test build #80145 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80145/testReport)** for PR 18742 at commit [`ac4cf70`](https://github.com/apache/spark/commit/ac4cf70d7968848638acc080c96f5397275b4655). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18746: [ML][Python] Implemented UnaryTransformer in Pyth...
Github user ajaysaini725 commented on a diff in the pull request: https://github.com/apache/spark/pull/18746#discussion_r130775517 --- Diff: python/pyspark/ml/base.py --- @@ -116,3 +121,44 @@ class Model(Transformer): """ __metaclass__ = ABCMeta + + +@inherit_doc +class UnaryTransformer(HasInputCol, HasOutputCol, Transformer): + +@abstractmethod +def createTransformFunc(self): +""" +Creates the transoform function using the given param map. +""" +raise NotImplementedError() + +@abstractmethod +def outputDataType(self): +""" +Returns the data type of the output column as a sql type +""" +raise NotImplementedError() + +@abstractmethod +def validateInputType(self, inputType): +""" +Validates the input type. Throws an exception if it is invalid. +""" +raise NotImplementedError() + +def transformSchema(self, schema): +inputType = schema[self.getInputCol()].dataType +self.validateInputType(inputType) +if self.getOutputCol() in schema.names: +raise ValueError("Output column %s already exists." % self.getOutputCol()) +outputFields = copy.copy(schema.fields) +outputFields.append(StructField(self.getOutputCol(), +self.outputDataType(), +nullable=False)) +return StructType(outputFields) + +def transform(self, dataset, paramMap=None): +transformSchema(dataset.schema()) +transformUDF = udf(self.createTransformFunc(), self.outputDataType()) +dataset.withColumn(self.getOutputCol(), transformUDF(self.getInputCol())) --- End diff -- self.createTransformFunc returns a function which is passed to the udf so in this case I think it is okay --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18799: [SPARK-21596][SS]Ensure places calling HDFSMetadataLog.g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18799 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18799: [SPARK-21596][SS]Ensure places calling HDFSMetadataLog.g...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18799 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80138/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18799: [SPARK-21596][SS]Ensure places calling HDFSMetadataLog.g...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18799 **[Test build #80138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80138/testReport)** for PR 18799 at commit [`91efeb3`](https://github.com/apache/spark/commit/91efeb3553e5e9f0f6fb45ae7574231c2e70d845). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpic...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/18734#discussion_r130774005 --- Diff: python/pyspark/cloudpickle.py --- @@ -220,12 +322,7 @@ def save_function(self, obj, name=None): if name is None: name = obj.__name__ -try: -# whichmodule() could fail, see -# https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling -modname = pickle.whichmodule(obj, name) -except Exception: -modname = None +modname = pickle.whichmodule(obj, name) --- End diff -- It's a good question, the underlying issue was marked resolved in 2014 and from looking at the commit it seems like it should actually be resolved. That being said its true some people might be on a system installed with an old verison of six so perhaps we should keep this workaround. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18734: [SPARK-21070][PYSPARK] Attempt to update cloudpic...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/18734#discussion_r130773575 --- Diff: python/pyspark/cloudpickle.py --- @@ -397,42 +625,7 @@ def save_global(self, obj, name=None, pack=struct.pack): typ = type(obj) if typ is not obj and isinstance(obj, (type, types.ClassType)): -d = dict(obj.__dict__) # copy dict proxy to a dict -if not isinstance(d.get('__dict__', None), property): -# don't extract dict that are properties -d.pop('__dict__', None) -d.pop('__weakref__', None) - -# hack as __new__ is stored differently in the __dict__ -new_override = d.get('__new__', None) -if new_override: -d['__new__'] = obj.__new__ - -# workaround for namedtuple (hijacked by PySpark) -if getattr(obj, '_is_namedtuple_', False): -self.save_reduce(_load_namedtuple, (obj.__name__, obj._fields)) -return - -self.save(_load_class) -self.save_reduce(typ, (obj.__name__, obj.__bases__, {"__doc__": obj.__doc__}), obj=obj) -d.pop('__doc__', None) -# handle property and staticmethod -dd = {} -for k, v in d.items(): --- End diff -- Lets double check this part with @davies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/18805 re build failure: you can repro that locally by running "./dev/test-dependencies.sh". Its failing due to introducing a new dep... you need to add it to `dev/deps/spark-deps-hadoop-XXX` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18806: [SPARK-21600] The description of "this requires spark.sh...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18806 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18805#discussion_r130769482 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -50,13 +51,14 @@ private[spark] object CompressionCodec { private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = { (codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec] - || codec.isInstanceOf[LZ4CompressionCodec]) + || codec.isInstanceOf[LZ4CompressionCodec] || codec.isInstanceOf[ZStandardCompressionCodec]) } private val shortCompressionCodecNames = Map( "lz4" -> classOf[LZ4CompressionCodec].getName, "lzf" -> classOf[LZFCompressionCodec].getName, -"snappy" -> classOf[SnappyCompressionCodec].getName) +"snappy" -> classOf[SnappyCompressionCodec].getName, +"zstd" -> classOf[SnappyCompressionCodec].getName) --- End diff -- you mean `ZStandardCompressionCodec` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/18805 In `Benchmark` section the values for `Lz4` are all zeros which feels confusing while reading.. first thing I thought is they were absolute values but they are supposed to be relative --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org