[GitHub] spark issue #18315: [SPARK-21108] [ML] [WIP] convert LinearSVC to aggregator...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18315 **[Test build #80661 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80661/testReport)** for PR 18315 at commit [`94e0250`](https://github.com/apache/spark/commit/94e025055a7755460cb83afe375d11a99dda8c0c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache w...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18934 cc @viirya --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18944: [SPARK-21732][SQL]Lazily init hive metastore clie...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18944 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18944 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18939: [SPARK-21724][SQL][DOC] Adds since information in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18939 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18939: [SPARK-21724][SQL][DOC] Adds since information in the do...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18939 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18939: [SPARK-21724][SQL][DOC] Adds since information in the do...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18939 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80660 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80660/testReport)** for PR 18902 at commit [`fd1eb43`](https://github.com/apache/spark/commit/fd1eb43d26bb08806ec2deefd86014caf7dcefdd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18944 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18944 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80655/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18944 **[Test build #80655 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80655/testReport)** for PR 18944 at commit [`9eb9149`](https://github.com/apache/spark/commit/9eb91493aef7d4ce0e0f4cccd5597a4815d6fd58). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18926 Even if we plan to drop `long` in this PR, [the checking](https://github.com/nchammas/spark/blob/fc1d84f002f5bd66bcad038a5581a05ade8dbc35/python/pyspark/sql/column.py#L408) looks weird to me. Basically, the change just wants to ensure the type of `length ` is `int`. Since this PR is pretty small, we should fix the issue instead of opening another one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache w...
Github user yzheng616 commented on the issue: https://github.com/apache/spark/pull/18934 Please try to fix it in 2.1 too. We have a product running on this version Spark. Thanks a lot! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18926 For ^, I want to make this separate if possible. Do you guys strongly feel about supporting `long` (and namely "mixed" types) here - @gatorsmile and @ueshin? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL][WIP] Decouple consume functions of ph...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80654/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL][WIP] Decouple consume functions of ph...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL][WIP] Decouple consume functions of ph...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18931 **[Test build #80654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80654/testReport)** for PR 18931 at commit [`c04da15`](https://github.com/apache/spark/commit/c04da158a4221104717944f72be2c67ae23fdb63). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18946 **[Test build #80659 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80659/testReport)** for PR 18946 at commit [`9f2ec8f`](https://github.com/apache/spark/commit/9f2ec8f2c2465d12d06c944b814f5363a99a0271). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18926#discussion_r133122900 --- Diff: python/pyspark/sql/column.py --- @@ -406,7 +406,13 @@ def substr(self, startPos, length): [Row(col=u'Ali'), Row(col=u'Bob')] """ if type(startPos) != type(length): -raise TypeError("Can not mix the type") +raise TypeError( +"startPos and length must be the same type. " +"Got {startPos_t} and {length_t}, respectively." --- End diff -- Yea, I think we could support `long`. I think this PR basically targets exception message fix. Could we make this separate? I guess supporting the case above requires a set of regression tests with min/max of int, fix for documentation and etc, which I think is rather loosely related with the JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18942: [BACKPORT-2.1][SPARK-19372][SQL] Fix throwing a Java exc...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18942 This might be too risky to be merged to 2.1.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18938: [SPARK-21363][SQL] Prevent name duplication in (global/l...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18938 ok, I'll close for now. If you get time, could you set 3.0 at the jira target? Thx! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18938: [SPARK-21363][SQL] Prevent name duplication in (g...
Github user maropu closed the pull request at: https://github.com/apache/spark/pull/18938 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18938: [SPARK-21363][SQL] Prevent name duplication in (global/l...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18938 Not sure. You can bring it up when we decide to jump to 3.x. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] BinaryComparison shouldn't auto cast ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18853 Currently, the type casting has a few issues when types are different. So far, we do not have any good option to resolve all the issues. Thus, we are hesitant to introduce any behavior change unless this is well defined. Could you do a research to see how the others behave? Any rule? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133121659 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + de
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18926 It sounds like the comment hides. Could you address the comment https://github.com/apache/spark/pull/18926#discussion_r133121408? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18938: [SPARK-21363][SQL] Prevent name duplication in (global/l...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18938 We still support this for 3.x releases? If so, I'll close jira as "Won't fix". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/18902 @hhbyyh Good Idea! We can also use this trick to compute median, because method `multipleApproxQuantiles`[https://github.com/apache/spark/blob/0e80ecae300f3e2033419b2d98da8bf092c105bb/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala#L65] can handle both `null` and `NaN` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/18926#discussion_r133121408 --- Diff: python/pyspark/sql/column.py --- @@ -406,7 +406,13 @@ def substr(self, startPos, length): [Row(col=u'Ali'), Row(col=u'Bob')] """ if type(startPos) != type(length): -raise TypeError("Can not mix the type") +raise TypeError( +"startPos and length must be the same type. " +"Got {startPos_t} and {length_t}, respectively." --- End diff -- I'm sorry for the delay. I guess we can support `long` by casting to `int` and also the "mixed" cases @gatorsmile metioned. What do you think @HyukjinKwon ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18798 @yanboliang I will update ASAP, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/18798 @WeichenXu123 I left some minor comments, otherwise, LGTM. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133120700 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + def metr
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133120739 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala --- @@ -0,0 +1,619 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import org.scalatest.exceptions.TestFailedException + +import org.apache.spark.{SparkException, SparkFunSuite} +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => OldVectors} +import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, Statistics} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema + +class SummarizerSuite extends SparkFunSuite with MLlibTestSparkContext { + + import testImplicits._ + import Summarizer._ + import SummaryBuilderImpl._ + + private case class ExpectedMetrics( + mean: Seq[Double], + variance: Seq[Double], + count: Long, + numNonZeros: Seq[Long], + max: Seq[Double], + min: Seq[Double], + normL2: Seq[Double], + normL1: Seq[Double]) + + // The input is expected to be either a sparse vector, a dense vector or an array of doubles + // (which will be converted to a dense vector) + // The expected is the list of all the known metrics. + // + // The tests take an list of input vectors and a list of all the summary values that + // are expected for this input. They currently test against some fixed subset of the + // metrics, but should be made fuzzy in the future. + + private def testExample(name: String, input: Seq[Any], exp: ExpectedMetrics): Unit = { +def inputVec: Seq[Vector] = input.map { + case x: Array[Double @unchecked] => Vectors.dense(x) + case x: Seq[Double @unchecked] => Vectors.dense(x.toArray) + case x: Vector => x + case x => throw new Exception(x.toString) +} + +val s = { + val s2 = new MultivariateOnlineSummarizer + inputVec.foreach(v => s2.add(OldVectors.fromML(v))) + s2 +} + +// Because the Spark context is reset between tests, we cannot hold a reference onto it. +def wrapped() = { + val df = sc.parallelize(inputVec).map(Tuple1.apply).toDF("features") + val c = df.col("features") + (df, c) +} + +registerTest(s"$name - mean only") { + val (df, c) = wrapped() + compare(df.select(metrics("mean").summary(c), mean(c)), Seq(Row(exp.mean), s.mean)) +} + +registerTest(s"$name - mean only (direct)") { + val (df, c) = wrapped() + compare(df.select(mean(c)), Seq(exp.mean)) +} + +registerTest(s"$name - variance only") { + val (df, c) = wrapped() + compare(df.select(metrics("variance").summary(c), variance(c)), +Seq(Row(exp.variance), s.variance)) +} + +registerTest(s"$name - variance only (direct)") { + val (df, c) = wrapped() + compare(df.select(variance(c)), Seq(s.variance)) +} + +registerTest(s"$name - count only") { + val (df, c) = wrapped() + compare(df.select(metrics("count").summary(c), count(c)), +Seq(Row(exp.count), exp.count)) +} + +registerTest(s"$name - count only (direct)") { + val (df, c) = wrapped() + compare(df.select(count(c)), +Seq(exp.count)) +} + +registerTest(s"$name - numNonZeros only") { + val (df, c) = wrapped() + compare(df.select(metrics("numNonZeros").summary(c), numNonZeros(c)), +Seq(Row(exp.numNonZeros), exp.numNonZeros)) +} + +registerTest(s"$name - numNonZeros only (direct)") {
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133109722 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + def
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133115762 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + def
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133115706 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + def
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133120087 --- Diff: mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala --- @@ -0,0 +1,619 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import org.scalatest.exceptions.TestFailedException + +import org.apache.spark.{SparkException, SparkFunSuite} +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => OldVectors} +import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, Statistics} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema + +class SummarizerSuite extends SparkFunSuite with MLlibTestSparkContext { + + import testImplicits._ + import Summarizer._ + import SummaryBuilderImpl._ + + private case class ExpectedMetrics( + mean: Seq[Double], + variance: Seq[Double], + count: Long, + numNonZeros: Seq[Long], + max: Seq[Double], + min: Seq[Double], + normL2: Seq[Double], + normL1: Seq[Double]) + + // The input is expected to be either a sparse vector, a dense vector or an array of doubles + // (which will be converted to a dense vector) + // The expected is the list of all the known metrics. + // + // The tests take an list of input vectors and a list of all the summary values that + // are expected for this input. They currently test against some fixed subset of the + // metrics, but should be made fuzzy in the future. + + private def testExample(name: String, input: Seq[Any], exp: ExpectedMetrics): Unit = { +def inputVec: Seq[Vector] = input.map { + case x: Array[Double @unchecked] => Vectors.dense(x) + case x: Seq[Double @unchecked] => Vectors.dense(x.toArray) + case x: Vector => x + case x => throw new Exception(x.toString) +} + +val s = { + val s2 = new MultivariateOnlineSummarizer + inputVec.foreach(v => s2.add(OldVectors.fromML(v))) + s2 +} + +// Because the Spark context is reset between tests, we cannot hold a reference onto it. +def wrapped() = { + val df = sc.parallelize(inputVec).map(Tuple1.apply).toDF("features") + val c = df.col("features") + (df, c) +} + +registerTest(s"$name - mean only") { + val (df, c) = wrapped() + compare(df.select(metrics("mean").summary(c), mean(c)), Seq(Row(exp.mean), s.mean)) +} + +registerTest(s"$name - mean only (direct)") { + val (df, c) = wrapped() + compare(df.select(mean(c)), Seq(exp.mean)) +} + +registerTest(s"$name - variance only") { + val (df, c) = wrapped() + compare(df.select(metrics("variance").summary(c), variance(c)), +Seq(Row(exp.variance), s.variance)) +} + +registerTest(s"$name - variance only (direct)") { + val (df, c) = wrapped() + compare(df.select(variance(c)), Seq(s.variance)) +} + +registerTest(s"$name - count only") { + val (df, c) = wrapped() + compare(df.select(metrics("count").summary(c), count(c)), +Seq(Row(exp.count), exp.count)) +} + +registerTest(s"$name - count only (direct)") { + val (df, c) = wrapped() + compare(df.select(count(c)), +Seq(exp.count)) +} + +registerTest(s"$name - numNonZeros only") { + val (df, c) = wrapped() + compare(df.select(metrics("numNonZeros").summary(c), numNonZeros(c)), +Seq(Row(exp.numNonZeros), exp.numNonZeros)) +} + +registerTest(s"$name - numNonZeros only (direct)") {
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133111870 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + def
[GitHub] spark issue #18938: [SPARK-21363][SQL] Prevent name duplication in (global/l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18938 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18938: [SPARK-21363][SQL] Prevent name duplication in (global/l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18938 **[Test build #80658 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80658/testReport)** for PR 18938 at commit [`cc220e9`](https://github.com/apache/spark/commit/cc220e91014c2ab5b5be44de85ef4fef67c6b243). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18938: [SPARK-21363][SQL] Prevent name duplication in (global/l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18938 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80658/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18938: [SPARK-21363][SQL] Prevent name duplication in (global/l...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18938 IMHO, other database(-like) systems prevent this name duplication in temporary table/view, so it seems we might do so along with them. ``` // postgresql case postgres=# create table s0 (a int, b int); CREATE TABLE postgres=# create table s1 (a int, b int); CREATE TABLE postgres=# select * from s0, s1 where s0.a = s1.a; a | b | a | b ---+---+---+--- (0 rows) postgres=# create view s2 as select * from s0, s1 where s0.a = s1.a; ERROR: column "a" specified more than once postgres=# create temporary view s2 as select * from s0, s1 where s0.a = s1.a; ERROR: column "a" specified more than once // mysql case mysql> create table s0 (a int, b int); Query OK, 0 rows affected (0.02 sec) mysql> create table s1 (a int, b int); Query OK, 0 rows affected (0.02 sec) mysql> select * from s0, s1 where s0.a = s1.a; Empty set (0.00 sec) mysql> create view s2 as select * from s0, s1 where s0.a = s1.a; ERROR 1060 (42S21): Duplicate column name 'a' mysql> create temporary table s2 as select * from s0, s1 where s0.a = s1.a; ERROR 1060 (42S21): Duplicate column name 'a' // hive case hive> create table s0 (a int, b int); OK hive> create table s1 (a int, b int); OK hive> select * from s0, s1 where s0.a = s1.a; OK hive> create temporary view s2 as select * from s0, s1 where s0.a = s1.a; FAILED: ParseException line 1:17 cannot recognize input near 'create' 'temporary' 'view' in ddl statement hive> create view s2 as select * from s0, s1 where s0.a = s1.a; FAILED: SemanticException org.apache.hadoop.hive.ql.optimizer.calcite.CalciteViewSemanticException: Duplicate column name: a hive> create temporary table s2 as select * from s0, s1 where s0.a = s1.a; FAILED: SemanticException [Error 10036]: Duplicate column name: a ``` But, I think this name duplication is an expected one in the 2.x releases (for example, we already have some tests related to this: https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/ErrorPositionSuite.scala#L37). So, this might be fixed for v3.x releases, maybe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18093: [WIP][SPARK-20774][SQL] Cancel all jobs when Quer...
Github user liyichao closed the pull request at: https://github.com/apache/spark/pull/18093 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18938: [SPARK-21363][SQL] Prevent name duplication in (global/l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18938 **[Test build #80658 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80658/testReport)** for PR 18938 at commit [`cc220e9`](https://github.com/apache/spark/commit/cc220e91014c2ab5b5be44de85ef4fef67c6b243). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18093: [WIP][SPARK-20774][SQL] Cancel all jobs when QueryExecti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18093 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133119397 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + de
[GitHub] spark issue #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache w...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18934 Tried to merge to Spark 2.1, but I hit a conflict. Thus, please let me know if anybody wants it in 2.1. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18934 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache w...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18934 Thanks! Merging to master/2.2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/18902 Eh, I meant that it may be possible to get the mean values purely using DataFrame API. (convert missingValue/NaN to null) in one pass, so we may need to check the performance comparison. But I guess it looks a little hack. For the median value, it may be harder so we can use the RDD API. (to be confirmed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache w...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18934 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18946 LGTM except two minor comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18946: [SPARK-19471][SQL]AggregationIterator does not in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18946#discussion_r133118690 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -573,4 +578,47 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { spark.sql("SELECT 3 AS c, 4 AS d, SUM(b) FROM testData2 GROUP BY c, d"), Seq(Row(3, 4, 9))) } + + private def assertNoExceptions(c: Column): Unit = { --- End diff -- Could you move it to line 566, just before `test("SPARK-21580 ints in aggregation expressions are taken as group-by ordinal.")`? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18946 **[Test build #80657 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80657/testReport)** for PR 18946 at commit [`f057ff8`](https://github.com/apache/spark/commit/f057ff8400076fce615fe9b6521ed1b3d66cb669). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18946 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18946: [SPARK-19471][SQL]AggregationIterator does not in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18946#discussion_r133118490 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -17,13 +17,18 @@ package org.apache.spark.sql +import scala.util.Random + +import org.apache.spark.sql.execution.WholeStageCodegenExec +import org.apache.spark.sql.execution.aggregate.{HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SharedSQLContext import org.apache.spark.sql.test.SQLTestData.DecimalData import org.apache.spark.sql.types.{Decimal, DecimalType} + --- End diff -- Please remove this line. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18922: [SPARK-21701][CORE] Enable RPC client to use SO_RCVBUF, ...
Github user neoremind commented on the issue: https://github.com/apache/spark/pull/18922 @zsxwing Would you mind verifying the patch for me? I notice you have contributed to rpc module in spark. Many thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18926 LGTM except for the comment above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18926#discussion_r133116060 --- Diff: python/pyspark/sql/tests.py --- @@ -1220,6 +1220,13 @@ def test_rand_functions(self): rndn2 = df.select('key', functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)) +def test_string_functions(self): +from pyspark.sql.functions import col, lit +df = self.spark.createDataFrame([['nick']], schema=['name']) +self.assertRaises(TypeError, lambda: df.select(col('name').substr(0, lit(1 --- End diff -- How about something like this below as this PR targets the exception message? ```python startPos = 0 length = lit(1) self.assertRaisesRegexp( TypeError, "must be the same type.*%s.*%s.*" % (type(startPos), type(length)), lambda: df.select(col('name').substr(startPos, length))) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133117021 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -359,14 +359,14 @@ case class JsonTuple(children: Seq[Expression]) @transient private lazy val jsonExpr: Expression = children.head // the fields to query are the remaining children - @transient private lazy val fieldExpressions: Seq[Expression] = children.tail + @transient private lazy val fieldExpressions: Array[Expression] = children.tail.toArray // eagerly evaluate any foldable the field names - @transient private lazy val foldableFieldNames: IndexedSeq[String] = { + @transient private lazy val foldableFieldNames: Array[Option[String]] = { --- End diff -- I think we should continue to use `IndexedSeq` which is more efficient as `foldableFieldNames` will be used for many times. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18818: [SPARK-21110][SQL] Structs, arrays, and other orderable ...
Github user aray commented on the issue: https://github.com/apache/spark/pull/18818 @viirya @gatorsmile I have addressed your comments, could you take another look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r133116720 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -465,7 +475,7 @@ abstract class BinaryComparison extends BinaryOperator with Predicate { } } - protected lazy val ordering = TypeUtils.getInterpretedOrdering(left.dataType) + protected lazy val ordering: Ordering[Any] = TypeUtils.getInterpretedOrdering(left.dataType) --- End diff -- addressed incc2f3ec --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18930: [SPARK-21677][SQL] json_tuple throws NullPointException ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18930 **[Test build #80656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80656/testReport)** for PR 18930 at commit [`0078445`](https://github.com/apache/spark/commit/0078445575064a3eec455bcec62fbed0c6dd2001). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user jmchung commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133116488 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2034,4 +2034,13 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } } + + test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") { +checkAnswer(sql( + """ +|SELECT json_tuple('{"a" : 1, "b" : 2}' +|, cast(NULL AS STRING), 'b' +|, cast(NULL AS STRING), 'a') + """.stripMargin), Row(null, "2", null, "1")) --- End diff -- @viirya Done, the added test case contains column name, constant field name, and null field name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18943: [SPARK-21731][build] Upgrade scalastyle to 0.9.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18943 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18943: [SPARK-21731][build] Upgrade scalastyle to 0.9.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18943 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80652/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133116269 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + def metr
[GitHub] spark issue #18943: [SPARK-21731][build] Upgrade scalastyle to 0.9.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18943 **[Test build #80652 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80652/testReport)** for PR 18943 at commit [`b1e49fa`](https://github.com/apache/spark/commit/b1e49fafe2e44a489f3c29c09ec7a7866120adda). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133115997 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + def metr
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18798 @WeichenXu123 Thanks! Looks good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18798: [SPARK-19634][ML] Multivariate summarizer - dataf...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18798#discussion_r133114092 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala --- @@ -0,0 +1,593 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.stat + +import java.io._ + +import org.apache.spark.annotation.Since +import org.apache.spark.internal.Logging +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.sql.Column +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Expression, UnsafeArrayData} +import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Complete, TypedImperativeAggregate} +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types._ + +/** + * A builder object that provides summary statistics about a given column. + * + * Users should not directly create such builders, but instead use one of the methods in + * [[Summarizer]]. + */ +@Since("2.3.0") +abstract class SummaryBuilder { + /** + * Returns an aggregate object that contains the summary of the column with the requested metrics. + * @param featuresCol a column that contains features Vector object. + * @param weightCol a column that contains weight value. + * @return an aggregate column that contains the statistics. The exact content of this + * structure is determined during the creation of the builder. + */ + @Since("2.3.0") + def summary(featuresCol: Column, weightCol: Column): Column + + @Since("2.3.0") + def summary(featuresCol: Column): Column = summary(featuresCol, lit(1.0)) +} + +/** + * Tools for vectorized statistics on MLlib Vectors. + * + * The methods in this package provide various statistics for Vectors contained inside DataFrames. + * + * This class lets users pick the statistics they would like to extract for a given column. Here is + * an example in Scala: + * {{{ + * val dataframe = ... // Some dataframe containing a feature column + * val allStats = dataframe.select(Summarizer.metrics("min", "max").summary($"features")) + * val Row(min_, max_) = allStats.first() + * }}} + * + * If one wants to get a single metric, shortcuts are also available: + * {{{ + * val meanDF = dataframe.select(Summarizer.mean($"features")) + * val Row(mean_) = meanDF.first() + * }}} + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ +@Since("2.3.0") +object Summarizer extends Logging { + + import SummaryBuilderImpl._ + + /** + * Given a list of metrics, provides a builder that it turns computes metrics from a column. + * + * See the documentation of [[Summarizer]] for an example. + * + * The following metrics are accepted (case sensitive): + * - mean: a vector that contains the coefficient-wise mean. + * - variance: a vector tha contains the coefficient-wise variance. + * - count: the count of all vectors seen. + * - numNonzeros: a vector with the number of non-zeros for each coefficients + * - max: the maximum for each coefficient. + * - min: the minimum for each coefficient. + * - normL2: the Euclidian norm for each coefficient. + * - normL1: the L1 norm of each coefficient (sum of the absolute values). + * @param firstMetric the metric being provided + * @param metrics additional metrics that can be provided. + * @return a builder. + * @throws IllegalArgumentException if one of the metric names is not understood. + * + * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * interface. + */ + @Since("2.3.0") + def metr
[GitHub] spark issue #18936: [SPARK-21688][ML][MLLIB] make native BLAS the first choi...
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/18936 Yes, they are not the only place, but we only tested on the dense dataset and got the performance data shown above. We are conservative on sparse data, so keep the sparse path the way it was. May change them when we have the concrete data prove the necessity to do so. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18798 @viirya Sure! comment updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18798: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18798 Sorry can we make the performance data clear? Currently it doesn't say what the unit of the numbers is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18944 **[Test build #80655 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80655/testReport)** for PR 18944 at commit [`9eb9149`](https://github.com/apache/spark/commit/9eb91493aef7d4ce0e0f4cccd5597a4815d6fd58). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/18944 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18930: [SPARK-21677][SQL] json_tuple throws NullPointExc...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18930#discussion_r133112441 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2034,4 +2034,13 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } } + + test("SPARK-21677: json_tuple throws NullPointException when column is null as string type") { +checkAnswer(sql( + """ +|SELECT json_tuple('{"a" : 1, "b" : 2}' +|, cast(NULL AS STRING), 'b' +|, cast(NULL AS STRING), 'a') + """.stripMargin), Row(null, "2", null, "1")) --- End diff -- @jmchung Can we also add the test we discussed in slack which mixes constant field name and non constant one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL][WIP] Decouple consume functions of ph...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18931 **[Test build #80654 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80654/testReport)** for PR 18931 at commit [`c04da15`](https://github.com/apache/spark/commit/c04da158a4221104717944f72be2c67ae23fdb63). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18902 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80653/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80653/testReport)** for PR 18902 at commit [`8283411`](https://github.com/apache/spark/commit/82834117d7e587e335a599f8d5153e751b524862). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed ...
Github user LiShuMing commented on the issue: https://github.com/apache/spark/pull/18905 @jerryshao Thanks for your replies! I will do such things then: 1. "it is good to change to other directories (is yarn internally relying on it)?" I think the recovery path(local variable) is only used in `YarnShuffleService`, principally not affects yarn environment. This PR cares the scene that we can find a better way to choose a useful disk for the recovery path when there are many disks that can choose. 2. Check HDFS/YARN's disk healthy check mechanism to better define `checkFileAvailable() `; 3. Fix code format. 4. Throw an exception when `_recoveryPath` is empty finally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 ping @MLnick ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/18946 @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18944 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80651/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18944 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18944 **[Test build #80651 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80651/testReport)** for PR 18944 at commit [`9eb9149`](https://github.com/apache/spark/commit/9eb91493aef7d4ce0e0f4cccd5597a4815d6fd58). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18946: [SPARK-19471][SQL]AggregationIterator does not initializ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18946 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18946: [SPARK-19471][SQL]AggregationIterator does not in...
GitHub user DonnyZone opened a pull request: https://github.com/apache/spark/pull/18946 [SPARK-19471][SQL]AggregationIterator does not initialize the generated result projection before using it ## What changes were proposed in this pull request? This is a follow-up PR that moves the test case in PR-18920 (https://github.com/apache/spark/pull/18920) to DataFrameAggregateSuit. ## How was this patch tested? unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/DonnyZone/spark branch-19471-followingPR Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18946.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18946 commit 9434fc767e6b7907b9beacb2f4358d767c9d4d32 Author: donnyzone Date: 2017-08-15T02:29:25Z move test case to DataFrameAggregateSuite commit f057ff8400076fce615fe9b6521ed1b3d66cb669 Author: donnyzone Date: 2017-08-15T02:37:33Z change import order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18902: [SPARK-21690][ML] one-pass imputer
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18902 **[Test build #80653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80653/testReport)** for PR 18902 at commit [`8283411`](https://github.com/apache/spark/commit/82834117d7e587e335a599f8d5153e751b524862). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18887: [SPARK-20642][core] Store FsHistoryProvider listi...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18887#discussion_r133102376 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -35,11 +38,14 @@ import org.apache.hadoop.security.AccessControlException import org.apache.spark.{SecurityManager, SparkConf, SparkException} import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.deploy.history.config._ import org.apache.spark.internal.Logging import org.apache.spark.scheduler._ import org.apache.spark.scheduler.ReplayListenerBus._ +import org.apache.spark.status.api.v1 import org.apache.spark.ui.SparkUI import org.apache.spark.util.{Clock, SystemClock, ThreadUtils, Utils} +import org.apache.spark.util.kvstore._ /** * A class that provides application history from event logs stored in the file system. --- End diff -- I saw the comments here refers to the removed data structure (though it is folded here), would you please fix the comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18887: [SPARK-20642][core] Store FsHistoryProvider listi...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18887#discussion_r133102774 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -117,17 +123,38 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) // used for logging msgs (logs are re-scanned based on file size, rather than modtime) private val lastScanTime = new java.util.concurrent.atomic.AtomicLong(-1) - // Mapping of application IDs to their metadata, in descending end time order. Apps are inserted - // into the map in order, so the LinkedHashMap maintains the correct ordering. - @volatile private var applications: mutable.LinkedHashMap[String, FsApplicationHistoryInfo] -= new mutable.LinkedHashMap() + private val pendingReplayTasksCount = new java.util.concurrent.atomic.AtomicInteger(0) - val fileToAppInfo = new ConcurrentHashMap[Path, FsApplicationAttemptInfo]() + private val storePath = conf.get(LOCAL_STORE_DIR) - // List of application logs to be deleted by event log cleaner. - private var attemptsToClean = new mutable.ListBuffer[FsApplicationAttemptInfo] + private val listing = storePath.map { path => +val dbPath = new File(path, "listing.ldb") - private val pendingReplayTasksCount = new java.util.concurrent.atomic.AtomicInteger(0) +def openDB(): LevelDB = new LevelDB(dbPath, new KVStoreScalaSerializer()) + +try { + val db = openDB() + val meta = db.getMetadata(classOf[KVStoreMetadata]) + + if (meta == null) { +db.setMetadata(new KVStoreMetadata(CURRENT_LISTING_VERSION, logDir.toString())) +db + } else if (meta.version != CURRENT_LISTING_VERSION || + !logDir.toString().equals(meta.logDir)) { --- End diff -- Minor: logDir` here a String, I think no need to call `toString` here and above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18887: [SPARK-20642][core] Store FsHistoryProvider listi...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/18887#discussion_r133104552 --- Diff: docs/monitoring.md --- @@ -220,6 +220,13 @@ The history server can be configured as follows: Number of threads that will be used by history server to process event logs. + +spark.history.store.path --- End diff -- Check from the log, looks like we don't have a default path for local history store, do we need to add a default value here in doc? ``` val LOCAL_STORE_DIR = ConfigBuilder("spark.history.store.path") .stringConf .createOptional ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18944 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18934 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80649/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18934 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18934: [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18934 **[Test build #80649 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80649/testReport)** for PR 18934 at commit [`2b0fd74`](https://github.com/apache/spark/commit/2b0fd74e0eebaf77f41f2e6e53830a961ff71c1e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18903: [SPARK-21590][SS]Window start time should support negati...
Github user KevinZwx commented on the issue: https://github.com/apache/spark/pull/18903 I added some tests in `DataFrameTimeWindowingSuite`, please have a look @brkyvz --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18945: Add option to convert nullable int columns to float colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18945 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18945: Add option to convert nullable int columns to flo...
GitHub user logannc opened a pull request: https://github.com/apache/spark/pull/18945 Add option to convert nullable int columns to float columns in toPand⦠â¦as to prevent needless Exceptions during routine use. Add the `strict=True` kwarg to DataFrame.toPandas to allow for a non-strict interpretation of the schema of a dataframe. This is currently limited to allowing a nullable int column to being interpreted as a float column (because that is the only way Pandas supports nullable int columns and actually crashes without this). I consider this small change to be a massive quality of life improvement for DataFrames with lots of nullable int columns, which would otherwise need a litany of `df.withColumn(name, F.col(name).cast(DoubleType()))`, etc, just to view them easily or interact with them in-memory. **Possible Objections** * I foresee concerns with the name of the kwarg, of which I am open to suggestions. * I also foresee possible objections due to the potential for needless conversion of nullable int columns to floats when there are actually no null values. I would counter those objections by noting that it only occurs when strict=False, which is not the default, and can be avoided on a per-column basis by setting the `nullable` property of the schema to False. **Alternatives** * Rename the kwarg to be specific to the current change. i.e., `nullable_int_to_float` instead of `strict` or some other, similar name. * Fix Pandas to allow nullable int columns. (Very difficult, per Wes McKinney, due to lack of NumPy support. https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value) You can merge this pull request into a Git repository by running: $ git pull https://github.com/logannc/spark nullable_int_pandas Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18945.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18945 commit bceeefca77dd3414e4ec97ad3570043ec3ce3059 Author: Logan Collins Date: 2017-08-15T01:30:08Z Add option to convert nullable int columns to float columns in toPandas to prevent needless crashes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org