[GitHub] spark issue #4066: [SPARK-4879] Use driver to coordinate Hadoop output commi...
Github user matrixlibing commented on the issue: https://github.com/apache/spark/pull/4066 SPARK-4879 also happened when use saveAsNewAPIHadoopFile. Why does not support the saveAsNewAPIHadoopFile function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14725 Maybe it could be cleared up a bit with a good docstring? Although if the result is too confusing to be used then it's probably not worth doing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15211#discussion_r94723792 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -0,0 +1,554 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import scala.collection.mutable + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => BreezeOWLQN} +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.linalg.BLAS._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.linalg.VectorImplicits._ +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.sql.functions.{col, lit} + +/** Params for linear SVM Classifier. */ +private[classification] trait LinearSVCParams extends ClassifierParams with HasRegParam + with HasMaxIter with HasFitIntercept with HasTol with HasStandardization with HasWeightCol + with HasThreshold with HasAggregationDepth { + --- End diff -- nit, the braces here can be removed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16296 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16296 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70904/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16296 **[Test build #70904 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70904/testReport)** for PR 16296 at commit [`91d173d`](https://github.com/apache/spark/commit/91d173de5e6610ea0621053320208fa8c6597b40). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12135: [SPARK-14352][SQL] approxQuantile should support multi c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12135 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70902/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12135: [SPARK-14352][SQL] approxQuantile should support multi c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12135 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12135: [SPARK-14352][SQL] approxQuantile should support multi c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12135 **[Test build #70902 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70902/testReport)** for PR 12135 at commit [`6517f21`](https://github.com/apache/spark/commit/6517f2186417fce1d0fcde1d5e8c561d686dcef0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16432: [SPARK-19021][YARN] Generailize HDFSCredentialProvider t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16432 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70910/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16432: [SPARK-19021][YARN] Generailize HDFSCredentialProvider t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16432 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16432: [SPARK-19021][YARN] Generailize HDFSCredentialProvider t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16432 **[Test build #70910 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70910/testReport)** for PR 16432 at commit [`131d420`](https://github.com/apache/spark/commit/131d420c779702f8a7129f29a4d1d361eb92a166). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16474 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16474 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70903/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16474 **[Test build #70903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70903/testReport)** for PR 16474 at commit [`586b347`](https://github.com/apache/spark/commit/586b347b04b64ddf2b70e4fb16035f80ad5a400e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16296 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70905/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16296 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16296 **[Test build #70905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70905/testReport)** for PR 16296 at commit [`83ecc24`](https://github.com/apache/spark/commit/83ecc2439fcc7314cfaf67cfc8c18c99abb16f31). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94721880 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -172,6 +187,8 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { model, ldaModel.logLikelihood(preprocessedData), ldaModel.logPerplexity(preprocessedData), + trainingLogLikelihood, + logPrior, --- End diff -- The first version, I got them from the model in the `LDAWrapper` class. However, when I read `logPrior`, I found that the loaded `logPrior` is not the same as the value before save. So, I followed the logLikelihood and logPerplexity to save it in the metadata. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16417: [SPARK-19014][SQL] support complex aggregate buff...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16417#discussion_r94721694 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala --- @@ -92,46 +89,53 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro s"$rowWriter.reset();" } -val writeFields = inputs.zip(inputTypes).zipWithIndex.map { - case ((input, dataType), index) => +val writeFields = inputs.zip(inputsTypeAndNullable).zipWithIndex.map { + case ((input, (dataType, nullable)), index) => val dt = dataType match { case udt: UserDefinedType[_] => udt.sqlType case other => other } val tmpCursor = ctx.freshName("tmpCursor") val setNull = dt match { - case t: DecimalType if t.precision > Decimal.MAX_LONG_DIGITS => -// Can't call setNullAt() for DecimalType with precision larger than 18. -s"$rowWriter.write($index, (Decimal) null, ${t.precision}, ${t.scale});" + case dt if UnsafeRow.isMutable(dt) && !UnsafeRow.isFixedLength(dt) => +def varLenDataSize(s: DataType): Int = s match { --- End diff -- Is it good to put this `varLenDataSize` into `UnsafeRow`? Writing a constant like 16 here makes the code untraceable for new comers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15819 Weird... Not sure why the build failed. The build works in my local environment. cc @srowen @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16417: [SPARK-19014][SQL] support complex aggregate buff...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16417#discussion_r94721395 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -327,24 +359,28 @@ class CodegenContext { nullable: Boolean, isVectorized: Boolean = false): String = { if (nullable) { - // Can't call setNullAt on DecimalType, because we need to keep the offset - if (!isVectorized && dataType.isInstanceOf[DecimalType]) { + // Can't call setNullAt for mutable but non-primitive data type, e.g. DecimalType, StructType + // with fix-length fields, because we need to keep the offset and length. + val isMutableVarLenField = UnsafeRow.isMutable(dataType) && !UnsafeRow.isFixedLength(dataType) --- End diff -- Do you think it is good to add `isMutableVarLenField` method into `UnsafeRow` directly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16347 Maybe we should make DataFrameWriter.sortBy work here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16432: [SPARK-19021][YARN] Generailize HDFSCredentialProvider t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16432 **[Test build #70910 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70910/testReport)** for PR 16432 at commit [`131d420`](https://github.com/apache/spark/commit/131d420c779702f8a7129f29a4d1d361eb92a166). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15819 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70908/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15819 **[Test build #70908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70908/consoleFull)** for PR 15819 at commit [`15da7a8`](https://github.com/apache/spark/commit/15da7a83c8599a0d6d28f5a0bce8a3132033867c). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16417: [SPARK-19014][SQL] support complex aggregate buffer in H...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16417 **[Test build #70909 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70909/testReport)** for PR 16417 at commit [`32e527d`](https://github.com/apache/spark/commit/32e527d902318c9e81e8586f592968ee08416acd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16417: [SPARK-19014][SQL] support complex aggregate buff...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16417#discussion_r94719867 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java --- @@ -303,6 +331,15 @@ public void setDecimal(int ordinal, Decimal value, int precision) { } } + public void setFixedLengthStruct(int ordinal, UnsafeRow row) { --- End diff -- Do we need call `setNotNullAt(ordinal)` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15819 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16417: [SPARK-19014][SQL] support complex aggregate buffer in H...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16417 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16417: [SPARK-19014][SQL] support complex aggregate buffer in H...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16417 Jenkins looks unstable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15819 **[Test build #70908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70908/consoleFull)** for PR 15819 at commit [`15da7a8`](https://github.com/apache/spark/commit/15da7a83c8599a0d6d28f5a0bce8a3132033867c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15819 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94719375 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -216,5 +219,37 @@ class VersionsSuite extends SparkFunSuite with Logging { "as 'COMPACT' WITH DEFERRED REBUILD") client.reset() } + +test(s"$version: CREATE TABLE AS SELECT") { + withTable("tbl") { +sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") +assert(sqlContext.table("tbl").collect().toSeq == Seq(Row(1))) + } +} + +test(s"$version: Delete the temporary staging directory and files after each insert") { + withTempDir { tmpDir => +withTable("tab", "tbl") { + sqlContext.sql( +s""" + |CREATE TABLE tab(c1 string) + |location '${tmpDir.toURI.toString}' + """.stripMargin) + + import sqlContext.implicits._ --- End diff -- Nit: move this import to line 231. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15819 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15819 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70907/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15819 **[Test build #70907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70907/consoleFull)** for PR 15819 at commit [`15da7a8`](https://github.com/apache/spark/commit/15da7a83c8599a0d6d28f5a0bce8a3132033867c). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94719123 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -54,6 +63,63 @@ case class InsertIntoHiveTable( @transient private lazy val hiveContext = new Context(sc.hiveconf) @transient private lazy val catalog = sc.catalog + @transient var createdTempDir: Option[Path] = None + val stagingDir = new HiveConf().getVar(HiveConf.ConfVars.STAGINGDIR) + + private def executionId: String = { +val rand: Random = new Random +val format: SimpleDateFormat = new SimpleDateFormat("-MM-dd_HH-mm-ss_SSS") +val executionId: String = "hive_" + format.format(new Date) + "_" + Math.abs(rand.nextLong) + executionId --- End diff -- Nit: an indent issue. Please remove one more space. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15819#discussion_r94719028 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -216,5 +219,37 @@ class VersionsSuite extends SparkFunSuite with Logging { "as 'COMPACT' WITH DEFERRED REBUILD") client.reset() } + +test(s"$version: CREATE TABLE AS SELECT") { + withTable("tbl") { +sqlContext.sql("CREATE TABLE tbl AS SELECT 1 AS a") +assert(sqlContext.table("tbl").collect().toSeq == Seq(Row(1))) + } +} + +test(s"$version: Delete the temporary staging directory and files after each insert") { + withTempDir { tmpDir => +withTable("tab", "tbl") { + sqlContext.sql( +s""" + |CREATE TABLE tab(c1 string) --- End diff -- Nit: two spaces -> one space --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16470: [SPARK-19033][Core] Add admin acls for history server
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16470 **[Test build #70906 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70906/testReport)** for PR 16470 at commit [`f4357e8`](https://github.com/apache/spark/commit/f4357e8ae890b0e0e021167ef796b7dd2f6cbb18). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15819 **[Test build #70907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70907/consoleFull)** for PR 15819 at commit [`15da7a8`](https://github.com/apache/spark/commit/15da7a83c8599a0d6d28f5a0bce8a3132033867c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15819: [SPARK-18372][SQL][Branch-1.6].Staging directory fail to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15819 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16470: [SPARK-19033][Core] Add admin acls for history server
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/16470 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16337 I compared the results and confirmed the results are consistent. LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94718428 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, FPGrowthModel => MLlibFPGrowthModel} +import org.apache.spark.sql.{DataFrame, _} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ArrayType, StringType, StructType} + +/** + * Common params for FPGrowth and FPGrowthModel + */ +private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.checkColumnType(schema, $(featuresCol), new ArrayType(StringType, false)) +SchemaUtils.appendColumn(schema, $(predictionCol), new ArrayType(StringType, false)) + } + + /** + * the minimal support level of the frequent pattern + * Default: 0.3 + * @group param + */ + @Since("2.2.0") + val minSupport: DoubleParam = new DoubleParam(this, "minSupport", +"the minimal support level of the frequent pattern (Default: 0.3)") + + /** @group getParam */ + @Since("2.2.0") + def getMinSupport: Double = $(minSupport) + +} + +/** + * :: Experimental :: + * A parallel FP-growth algorithm to mine frequent itemsets. + * + * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: Parallel FP-Growth for Query + * Recommendation]] + */ +@Since("2.2.0") +@Experimental +class FPGrowth @Since("2.2.0") ( +@Since("2.2.0") override val uid: String) + extends Estimator[FPGrowthModel] with FPGrowthParams with DefaultParamsWritable { + + @Since("2.2.0") + def this() = this(Identifiable.randomUID("FPGrowth")) + + /** @group setParam */ + @Since("2.2.0") + def setMinSupport(value: Double): this.type = set(minSupport, value) + setDefault(minSupport -> 0.3) + + /** @group setParam */ + @Since("2.2.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** @group setParam */ + @Since("2.2.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + def fit(dataset: Dataset[_]): FPGrowthModel = { +val data = dataset.select($(featuresCol)).rdd.map(r => r.getSeq[String](0).toArray) +val parentModel = new MLlibFPGrowth().setMinSupport($(minSupport)).run(data) +copyValues(new FPGrowthModel(uid, parentModel)) + } + + @Since("2.2.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + override def copy(extra: ParamMap): FPGrowth = defaultCopy(extra) +} + + +@Since("2.2.0") +object FPGrowth extends DefaultParamsReadable[FPGrowth] { + + @Since("2.2.0") + override def load(path: String): FPGrowth = super.load(path) +} + +/** + * :: Experimental :: + * Model fitted by FPGrowth. + * + * @param parentModel a model trained by spark.mllib.fpm.FPGrowth + */ +@Since("2.2.0") +@Experimental +class FPGrowthModel private[ml] ( +@Since("2.2.0") override val uid: String, +private val parentModel: MLlibFPGrowthModel[_]) + extends Model[FPGrowthModel] with FPGrowthParams
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16451 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70899/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16451 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16460: [SPARK-19058][SQL] fix partition related behavior...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16460 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16460: [SPARK-19058][SQL] fix partition related behaviors with ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16460 thanks for the view, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16451 **[Test build #70899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70899/testReport)** for PR 16451 at commit [`79a17f4`](https://github.com/apache/spark/commit/79a17f4ef67d8f2fcb6bb7e415cd9c15226a1773). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN s...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16337#discussion_r94718082 --- Diff: sql/core/src/test/resources/sql-tests/results/subquery/in-subquery/in-group-by.sql.out --- @@ -0,0 +1,357 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 19 + + +-- !query 0 +create temporary view t1 as select * from values + ("t1a", 6S, 8, 10L, float(15.0), 20D, 20E2, timestamp '2014-04-04 01:00:00.000', date '2014-04-04'), + ("t1b", 8S, 16, 19L, float(17.0), 25D, 26E2, timestamp '2014-05-04 01:01:00.000', date '2014-05-04'), + ("t1a", 16S, 12, 21L, float(15.0), 20D, 20E2, timestamp '2014-06-04 01:02:00.001', date '2014-06-04'), + ("t1a", 16S, 12, 10L, float(15.0), 20D, 20E2, timestamp '2014-07-04 01:01:00.000', date '2014-07-04'), + ("t1c", 8S, 16, 19L, float(17.0), 25D, 26E2, timestamp '2014-05-04 01:02:00.001', date '2014-05-05'), + ("t1d", null, 16, 22L, float(17.0), 25D, 26E2, timestamp '2014-06-04 01:01:00.000', null), + ("t1d", null, 16, 19L, float(17.0), 25D, 26E2, timestamp '2014-07-04 01:02:00.001', null), + ("t1e", 10S, null, 25L, float(17.0), 25D, 26E2, timestamp '2014-08-04 01:01:00.000', date '2014-08-04'), + ("t1e", 10S, null, 19L, float(17.0), 25D, 26E2, timestamp '2014-09-04 01:02:00.001', date '2014-09-04'), + ("t1d", 10S, null, 12L, float(17.0), 25D, 26E2, timestamp '2015-05-04 01:01:00.000', date '2015-05-04'), + ("t1a", 6S, 8, 10L, float(15.0), 20D, 20E2, timestamp '2014-04-04 01:02:00.001', date '2014-04-04'), + ("t1e", 10S, null, 19L, float(17.0), 25D, 26E2, timestamp '2014-05-04 01:01:00.000', date '2014-05-04') + as t1(t1a, t1b, t1c, t1d, t1e, t1f, t1g, t1h, t1i) +-- !query 0 schema +struct<> +-- !query 0 output + + + +-- !query 1 +create temporary view t2 as select * from values + ("t2a", 6S, 12, 14L, float(15), 20D, 20E2, timestamp '2014-04-04 01:01:00.000', date '2014-04-04'), + ("t1b", 10S, 12, 19L, float(17), 25D, 26E2, timestamp '2014-05-04 01:01:00.000', date '2014-05-04'), + ("t1b", 8S, 16, 119L, float(17), 25D, 26E2, timestamp '2015-05-04 01:01:00.000', date '2015-05-04'), + ("t1c", 12S, 16, 219L, float(17), 25D, 26E2, timestamp '2016-05-04 01:01:00.000', date '2016-05-04'), + ("t1b", null, 16, 319L, float(17), 25D, 26E2, timestamp '2017-05-04 01:01:00.000', null), + ("t2e", 8S, null, 419L, float(17), 25D, 26E2, timestamp '2014-06-04 01:01:00.000', date '2014-06-04'), + ("t1f", 19S, null, 519L, float(17), 25D, 26E2, timestamp '2014-05-04 01:01:00.000', date '2014-05-04'), + ("t1b", 10S, 12, 19L, float(17), 25D, 26E2, timestamp '2014-06-04 01:01:00.000', date '2014-06-04'), + ("t1b", 8S, 16, 19L, float(17), 25D, 26E2, timestamp '2014-07-04 01:01:00.000', date '2014-07-04'), + ("t1c", 12S, 16, 19L, float(17), 25D, 26E2, timestamp '2014-08-04 01:01:00.000', date '2014-08-05'), + ("t1e", 8S, null, 19L, float(17), 25D, 26E2, timestamp '2014-09-04 01:01:00.000', date '2014-09-04'), + ("t1f", 19S, null, 19L, float(17), 25D, 26E2, timestamp '2014-10-04 01:01:00.000', date '2014-10-04'), + ("t1b", null, 16, 19L, float(17), 25D, 26E2, timestamp '2014-05-04 01:01:00.000', null) + as t2(t2a, t2b, t2c, t2d, t2e, t2f, t2g, t2h, t2i) +-- !query 1 schema +struct<> +-- !query 1 output + + + +-- !query 2 +create temporary view t3 as select * from values + ("t3a", 6S, 12, 110L, float(15), 20D, 20E2, timestamp '2014-04-04 01:02:00.000', date '2014-04-04'), + ("t3a", 6S, 12, 10L, float(15), 20D, 20E2, timestamp '2014-05-04 01:02:00.000', date '2014-05-04'), + ("t1b", 10S, 12, 219L, float(17), 25D, 26E2, timestamp '2014-05-04 01:02:00.000', date '2014-05-04'), + ("t1b", 10S, 12, 19L, float(17), 25D, 26E2, timestamp '2014-05-04 01:02:00.000', date '2014-05-04'), + ("t1b", 8S, 16, 319L, float(17), 25D, 26E2, timestamp '2014-06-04 01:02:00.000', date '2014-06-04'), + ("t1b", 8S, 16, 19L, float(17), 25D, 26E2, timestamp '2014-07-04 01:02:00.000', date '2014-07-04'), + ("t3c", 17S, 16, 519L, float(17), 25D, 26E2, timestamp '2014-08-04 01:02:00.000', date '2014-08-04'), + ("t3c", 17S, 16, 19L, float(17), 25D, 26E2, timestamp '2014-09-04 01:02:00.000', date '2014-09-05'), + ("t1b", null, 16, 419L, float(17), 25D, 26E2, timestamp '2014-10-04 01:02:00.000', null), + ("t1b", null, 16, 19L, float(17), 25D, 26E2, timestamp '2014-11-04 01:02:00.000', null), + ("t3b", 8S, null, 719L, float(17), 25D, 26E2, timestamp '2014-05-04 01:02:00.000', date '2014-05-04'), + ("t3b", 8S, null, 19L, float(17), 25D, 26E2, timestamp '2015-05-04 01:02:00.000', date '2015-05-04') + as t3(t3a, t3b, t3c, t3d, t3e, t3f, t3g, t3h, t3i) +-- !query 2 schema +struct<> +-- !query 2 output + + + +-- !query
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94717710 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, FPGrowthModel => MLlibFPGrowthModel} +import org.apache.spark.sql.{DataFrame, _} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ArrayType, StringType, StructType} + +/** + * Common params for FPGrowth and FPGrowthModel + */ +private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.checkColumnType(schema, $(featuresCol), new ArrayType(StringType, false)) +SchemaUtils.appendColumn(schema, $(predictionCol), new ArrayType(StringType, false)) + } + + /** + * the minimal support level of the frequent pattern + * Default: 0.3 + * @group param + */ + @Since("2.2.0") + val minSupport: DoubleParam = new DoubleParam(this, "minSupport", +"the minimal support level of the frequent pattern (Default: 0.3)") + + /** @group getParam */ + @Since("2.2.0") + def getMinSupport: Double = $(minSupport) + +} + +/** + * :: Experimental :: + * A parallel FP-growth algorithm to mine frequent itemsets. + * + * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: Parallel FP-Growth for Query + * Recommendation]] + */ +@Since("2.2.0") +@Experimental +class FPGrowth @Since("2.2.0") ( +@Since("2.2.0") override val uid: String) + extends Estimator[FPGrowthModel] with FPGrowthParams with DefaultParamsWritable { + + @Since("2.2.0") + def this() = this(Identifiable.randomUID("FPGrowth")) + + /** @group setParam */ + @Since("2.2.0") + def setMinSupport(value: Double): this.type = set(minSupport, value) + setDefault(minSupport -> 0.3) + + /** @group setParam */ + @Since("2.2.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** @group setParam */ + @Since("2.2.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + def fit(dataset: Dataset[_]): FPGrowthModel = { +val data = dataset.select($(featuresCol)).rdd.map(r => r.getSeq[String](0).toArray) +val parentModel = new MLlibFPGrowth().setMinSupport($(minSupport)).run(data) +copyValues(new FPGrowthModel(uid, parentModel)) + } + + @Since("2.2.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + override def copy(extra: ParamMap): FPGrowth = defaultCopy(extra) +} + + +@Since("2.2.0") +object FPGrowth extends DefaultParamsReadable[FPGrowth] { + + @Since("2.2.0") + override def load(path: String): FPGrowth = super.load(path) +} + +/** + * :: Experimental :: + * Model fitted by FPGrowth. + * + * @param parentModel a model trained by spark.mllib.fpm.FPGrowth + */ +@Since("2.2.0") +@Experimental +class FPGrowthModel private[ml] ( +@Since("2.2.0") override val uid: String, +private val parentModel: MLlibFPGrowthModel[_]) + extends Model[FPGrowthModel] with FPGrowthParams
[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 @chpritchard-expedia The patch here fixes the problem. I don't think it's possible to workaround the issue by using Spark API in some different ways, because we can't completely avoid memory spills at the writers. Hive doesn't have the problem, so maybe you can consider running the same statement on Hive if this is not something Spark wants to address. Anyway, for anyone who's interested, I could confirm that for a sorted ORC table built with this patch, point/range lookups on the sorted column can be several times faster. Also the final size of the table turned out to be significantly smaller in this case (60% of unsorted table) due to the temporal locality in our data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13252: [SPARK-15473][SQL] CSV data source writes header for emp...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/13252 Let me suggest a generalized way latter because it does not look a clean fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13252: [SPARK-15473][SQL] CSV data source writes header ...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/13252 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94717542 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, FPGrowthModel => MLlibFPGrowthModel} +import org.apache.spark.sql.{DataFrame, _} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ArrayType, StringType, StructType} + +/** + * Common params for FPGrowth and FPGrowthModel + */ +private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.checkColumnType(schema, $(featuresCol), new ArrayType(StringType, false)) +SchemaUtils.appendColumn(schema, $(predictionCol), new ArrayType(StringType, false)) + } + + /** + * the minimal support level of the frequent pattern + * Default: 0.3 + * @group param + */ + @Since("2.2.0") + val minSupport: DoubleParam = new DoubleParam(this, "minSupport", +"the minimal support level of the frequent pattern (Default: 0.3)") + + /** @group getParam */ + @Since("2.2.0") + def getMinSupport: Double = $(minSupport) + +} + +/** + * :: Experimental :: + * A parallel FP-growth algorithm to mine frequent itemsets. + * + * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: Parallel FP-Growth for Query + * Recommendation]] + */ +@Since("2.2.0") +@Experimental +class FPGrowth @Since("2.2.0") ( +@Since("2.2.0") override val uid: String) + extends Estimator[FPGrowthModel] with FPGrowthParams with DefaultParamsWritable { + + @Since("2.2.0") + def this() = this(Identifiable.randomUID("FPGrowth")) + + /** @group setParam */ + @Since("2.2.0") + def setMinSupport(value: Double): this.type = set(minSupport, value) + setDefault(minSupport -> 0.3) + + /** @group setParam */ + @Since("2.2.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** @group setParam */ + @Since("2.2.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) --- End diff -- ditto, `setOutputCol` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94717473 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, FPGrowthModel => MLlibFPGrowthModel} +import org.apache.spark.sql.{DataFrame, _} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ArrayType, StringType, StructType} + +/** + * Common params for FPGrowth and FPGrowthModel + */ +private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.checkColumnType(schema, $(featuresCol), new ArrayType(StringType, false)) +SchemaUtils.appendColumn(schema, $(predictionCol), new ArrayType(StringType, false)) + } + + /** + * the minimal support level of the frequent pattern + * Default: 0.3 + * @group param + */ + @Since("2.2.0") + val minSupport: DoubleParam = new DoubleParam(this, "minSupport", +"the minimal support level of the frequent pattern (Default: 0.3)") + + /** @group getParam */ + @Since("2.2.0") + def getMinSupport: Double = $(minSupport) + +} + +/** + * :: Experimental :: + * A parallel FP-growth algorithm to mine frequent itemsets. + * + * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: Parallel FP-Growth for Query + * Recommendation]] + */ +@Since("2.2.0") +@Experimental +class FPGrowth @Since("2.2.0") ( +@Since("2.2.0") override val uid: String) + extends Estimator[FPGrowthModel] with FPGrowthParams with DefaultParamsWritable { + + @Since("2.2.0") + def this() = this(Identifiable.randomUID("FPGrowth")) + + /** @group setParam */ + @Since("2.2.0") + def setMinSupport(value: Double): this.type = set(minSupport, value) + setDefault(minSupport -> 0.3) + + /** @group setParam */ + @Since("2.2.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) --- End diff -- I perfer use `setInputCol` and `inputCol` instead of this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16470: [SPARK-19033][Core] Add admin acls for history server
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16470 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16470: [SPARK-19033][Core] Add admin acls for history server
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16470 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70900/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16470: [SPARK-19033][Core] Add admin acls for history server
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16470 **[Test build #70900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70900/testReport)** for PR 16470 at commit [`f4357e8`](https://github.com/apache/spark/commit/f4357e8ae890b0e0e021167ef796b7dd2f6cbb18). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14284: [SPARK-16633] [SPARK-16642] [SPARK-16721] [SQL] Fixes th...
Github user chengat1314 commented on the issue: https://github.com/apache/spark/pull/14284 @hvanhovell Nice, thank you very much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/14725 Thanks @holdenk for taking a look! Yeah, I think you're right about the issues trying to infer a type. It would be nice if there was some easy way to specify a primitive type since that would cover most of the cases in mllib, but it's not that big of deal. If anything it's probably just a little obscure to a dev who's not familiar with py4j in Spark that for a sting the java class should be `sc._gateway.jvm.java.lang.String`, for example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94717245 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, FPGrowthModel => MLlibFPGrowthModel} +import org.apache.spark.sql.{DataFrame, _} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ArrayType, StringType, StructType} + +/** + * Common params for FPGrowth and FPGrowthModel + */ +private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.checkColumnType(schema, $(featuresCol), new ArrayType(StringType, false)) +SchemaUtils.appendColumn(schema, $(predictionCol), new ArrayType(StringType, false)) + } + + /** + * the minimal support level of the frequent pattern + * Default: 0.3 + * @group param + */ + @Since("2.2.0") + val minSupport: DoubleParam = new DoubleParam(this, "minSupport", +"the minimal support level of the frequent pattern (Default: 0.3)") + + /** @group getParam */ + @Since("2.2.0") + def getMinSupport: Double = $(minSupport) + --- End diff -- MLLib's `FPGrowth` have a param `numPartitions`, will it be included here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16435: [SPARK-19027][SQL] estimate size of object buffer...
Github user cloud-fan closed the pull request at: https://github.com/apache/spark/pull/16435 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16435: [SPARK-19027][SQL] estimate size of object buffer for ob...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16435 closing this as it's very hard to estimate the size and does not provide much benefit for end users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16460: [SPARK-19058][SQL] fix partition related behaviors with ...
Github user ericl commented on the issue: https://github.com/apache/spark/pull/16460 looks good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94717055 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, FPGrowthModel => MLlibFPGrowthModel} +import org.apache.spark.sql.{DataFrame, _} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ArrayType, StringType, StructType} + +/** + * Common params for FPGrowth and FPGrowthModel + */ +private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.checkColumnType(schema, $(featuresCol), new ArrayType(StringType, false)) +SchemaUtils.appendColumn(schema, $(predictionCol), new ArrayType(StringType, false)) + } + + /** + * the minimal support level of the frequent pattern + * Default: 0.3 + * @group param + */ + @Since("2.2.0") + val minSupport: DoubleParam = new DoubleParam(this, "minSupport", +"the minimal support level of the frequent pattern (Default: 0.3)") --- End diff -- also need a `ParamValidator` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15671 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15671 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70901/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15671 **[Test build #70901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70901/testReport)** for PR 15671 at commit [`c8693d8`](https://github.com/apache/spark/commit/c8693d870373978b4a43b2215a1b487215107d45). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16464: [SPARK-19066][SparkR]:SparkR LDA doesn't set opti...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16464#discussion_r94716683 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala --- @@ -172,6 +187,8 @@ private[r] object LDAWrapper extends MLReadable[LDAWrapper] { model, ldaModel.logLikelihood(preprocessedData), ldaModel.logPerplexity(preprocessedData), + trainingLogLikelihood, + logPrior, --- End diff -- since model is referenced and persisted, is there a need to handle trainingLogLikelihood and logPrior separately like this, and writing to metadata, instead of just getting from the model when fetching for the summary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16460: [SPARK-19058][SQL] fix partition related behavior...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16460#discussion_r94716609 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -74,12 +69,30 @@ case class InsertIntoHadoopFsRelationCommand( val fs = outputPath.getFileSystem(hadoopConf) val qualifiedOutputPath = outputPath.makeQualified(fs.getUri, fs.getWorkingDirectory) +val partitionsTrackedByCatalog = sparkSession.sessionState.conf.manageFilesourcePartitions && + catalogTable.isDefined && + catalogTable.get.partitionColumnNames.nonEmpty && + catalogTable.get.tracksPartitionsInCatalog + +var initialMatchingPartitions: Seq[TablePartitionSpec] = Nil +var customPartitionLocations: Map[TablePartitionSpec, String] = Map.empty --- End diff -- yea it's true, but then the code may looks ugly, e.g. ``` val (longVariableName: LongTypeNameXXX, longVariableName: LongTypeNameXXX) = { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16460: [SPARK-19058][SQL] fix partition related behaviors with ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16460 cc @ericl anymore comments on this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94716584 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala --- @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, Params} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.fpm.{AssociationRules => MLlibAssociationRules} +import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} + +/** + * :: Experimental :: + * + * Generates association rules from frequent itemsets ("items", "freq"). This method only generates + * association rules which have a single item as the consequent. + */ +@Since("2.1.0") --- End diff -- should be 2.2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16468: [SPARK-19074][SS][DOCS] Updated Structured Streaming Pro...
Github user david-weiluo-ren commented on the issue: https://github.com/apache/spark/pull/16468 @tdas It says âHowever, note that all of the operations applicable on static DataFrames/Datasets are not supported in streaming DataFrames/Datasets yetâ in https://spark.apache.org/docs/2.1.0/structured-streaming-programming-guide.html#unsupported-operations I think it should be ânot all of the operations â¦. are supported in ⦠yetâ instead of âall of the operations ⦠are not supported in ⦠yet" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16296 **[Test build #70905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70905/testReport)** for PR 16296 at commit [`83ecc24`](https://github.com/apache/spark/commit/83ecc2439fcc7314cfaf67cfc8c18c99abb16f31). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16422 Column-level security can block users to access the specific columns, but this command `DESC EXTENDED/FORMATTED COLUMN` might not be part of the design/solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94716372 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala --- @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, Params} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.fpm.{AssociationRules => MLlibAssociationRules} +import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} + +/** + * :: Experimental :: + * + * Generates association rules from frequent itemsets ("items", "freq"). This method only generates + * association rules which have a single item as the consequent. + */ +@Since("2.1.0") +@Experimental +class AssociationRules(override val uid: String) extends Params { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("AssociationRules")) + + /** + * Param for items column name. Items must be array of Integers. + * Default: "items" + * @group param + */ + final val itemsCol: Param[String] = new Param[String](this, "itemsCol", "items column name") + + + /** @group getParam */ + @Since("2.1.0") + final def getItemsCol: String = $(itemsCol) + + /** @group setParam */ + @Since("2.1.0") + def setItemsCol(value: String): this.type = set(itemsCol, value) + + /** + * Param for frequency column name. Data type should be Long. + * Default: "freq" + * @group param + */ + final val freqCol: Param[String] = new Param[String](this, "freqCol", "frequency column name") + + + /** @group getParam */ + @Since("2.1.0") + final def getFreqCol: String = $(freqCol) + + /** @group setParam */ + @Since("2.1.0") + def setFreqCol(value: String): this.type = set(freqCol, value) + + /** + * Param for minimum confidence, range [0.0, 1.0]. + * @group param + */ + final val minConfidence: DoubleParam = new DoubleParam(this, "minConfidence", "min confidence") + + /** @group getParam */ + @Since("2.1.0") + final def getMinConfidence: Double = $(minConfidence) + + /** @group setParam */ + @Since("2.1.0") + def setMinConfidence(value: Double): this.type = set(minConfidence, value) + + setDefault(itemsCol -> "items", freqCol -> "freq", minConfidence -> 0.8) + + /** + * Computes the association rules with confidence above [[minConfidence]]. + * @param freqItemsets DataFrame containing frequent itemset obtained from algorithms like + * [[FPGrowth]]. Users can set itemsCol (frequent itemSet, Array[String]) + * and freqCol (appearance count, Long) names in the DataFrame. + * @return a DataFrame("antecedent", "consequent", "confidence") containing the association +* rules. + * + */ + @Since("2.1.0") + def run(freqItemsets: Dataset[_]): DataFrame = { --- End diff -- If inheriting `Transform`, here should be `override def transform(dataset: Dataset[_]): DataFrame` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94716281 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala --- @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, Params} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.fpm.{AssociationRules => MLlibAssociationRules} +import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} + +/** + * :: Experimental :: + * + * Generates association rules from frequent itemsets ("items", "freq"). This method only generates + * association rules which have a single item as the consequent. + */ +@Since("2.1.0") +@Experimental +class AssociationRules(override val uid: String) extends Params { --- End diff -- Since `AssociationRules` transform DataFrame `freqItemsets` to DataFrame `rules`, can it be a subclass of `Transformer`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16296 **[Test build #70904 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70904/testReport)** for PR 16296 at commit [`91d173d`](https://github.com/apache/spark/commit/91d173de5e6610ea0621053320208fa8c6597b40). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94716047 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala --- @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, Params} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.fpm.{AssociationRules => MLlibAssociationRules} +import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} + +/** + * :: Experimental :: + * + * Generates association rules from frequent itemsets ("items", "freq"). This method only generates + * association rules which have a single item as the consequent. + */ +@Since("2.1.0") +@Experimental +class AssociationRules(override val uid: String) extends Params { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("AssociationRules")) + + /** + * Param for items column name. Items must be array of Integers. + * Default: "items" + * @group param + */ + final val itemsCol: Param[String] = new Param[String](this, "itemsCol", "items column name") + + + /** @group getParam */ + @Since("2.1.0") + final def getItemsCol: String = $(itemsCol) + + /** @group setParam */ + @Since("2.1.0") + def setItemsCol(value: String): this.type = set(itemsCol, value) + + /** + * Param for frequency column name. Data type should be Long. + * Default: "freq" + * @group param + */ + final val freqCol: Param[String] = new Param[String](this, "freqCol", "frequency column name") + + + /** @group getParam */ + @Since("2.1.0") + final def getFreqCol: String = $(freqCol) + + /** @group setParam */ + @Since("2.1.0") + def setFreqCol(value: String): this.type = set(freqCol, value) + + /** + * Param for minimum confidence, range [0.0, 1.0]. + * @group param + */ + final val minConfidence: DoubleParam = new DoubleParam(this, "minConfidence", "min confidence") + + /** @group getParam */ + @Since("2.1.0") + final def getMinConfidence: Double = $(minConfidence) + + /** @group setParam */ + @Since("2.1.0") + def setMinConfidence(value: Double): this.type = set(minConfidence, value) + + setDefault(itemsCol -> "items", freqCol -> "freq", minConfidence -> 0.8) + + /** + * Computes the association rules with confidence above [[minConfidence]]. + * @param freqItemsets DataFrame containing frequent itemset obtained from algorithms like + * [[FPGrowth]]. Users can set itemsCol (frequent itemSet, Array[String]) + * and freqCol (appearance count, Long) names in the DataFrame. + * @return a DataFrame("antecedent", "consequent", "confidence") containing the association +* rules. + * + */ + @Since("2.1.0") + def run(freqItemsets: Dataset[_]): DataFrame = { +val freqItemSetRdd = freqItemsets.select($(itemsCol), $(freqCol)).rdd + .map(row => new FreqItemset(row.getSeq[String](0).toArray, row.getLong(1))) + +val sqlContext = SparkSession.builder().getOrCreate() --- End diff -- Since val `sqlContext` is of type `SparkSession`, what about rename it `spark`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For
[GitHub] spark issue #16472: [SPARK-18877][SQL][BACKPORT-2.0] `CSVInferSchema.inferFi...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16472 Thanks! Merged to Spark 2.0. Could you please close it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94715820 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala --- @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, Params} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.fpm.{AssociationRules => MLlibAssociationRules} +import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} + +/** + * :: Experimental :: + * + * Generates association rules from frequent itemsets ("items", "freq"). This method only generates + * association rules which have a single item as the consequent. + */ +@Since("2.1.0") +@Experimental +class AssociationRules(override val uid: String) extends Params { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("AssociationRules")) + + /** + * Param for items column name. Items must be array of Integers. + * Default: "items" + * @group param + */ + final val itemsCol: Param[String] = new Param[String](this, "itemsCol", "items column name") + + + /** @group getParam */ + @Since("2.1.0") + final def getItemsCol: String = $(itemsCol) + + /** @group setParam */ + @Since("2.1.0") + def setItemsCol(value: String): this.type = set(itemsCol, value) + + /** + * Param for frequency column name. Data type should be Long. + * Default: "freq" + * @group param + */ + final val freqCol: Param[String] = new Param[String](this, "freqCol", "frequency column name") + + + /** @group getParam */ + @Since("2.1.0") + final def getFreqCol: String = $(freqCol) + + /** @group setParam */ + @Since("2.1.0") + def setFreqCol(value: String): this.type = set(freqCol, value) + + /** + * Param for minimum confidence, range [0.0, 1.0]. + * @group param + */ + final val minConfidence: DoubleParam = new DoubleParam(this, "minConfidence", "min confidence") + + /** @group getParam */ + @Since("2.1.0") + final def getMinConfidence: Double = $(minConfidence) + + /** @group setParam */ + @Since("2.1.0") + def setMinConfidence(value: Double): this.type = set(minConfidence, value) + + setDefault(itemsCol -> "items", freqCol -> "freq", minConfidence -> 0.8) + + /** + * Computes the association rules with confidence above [[minConfidence]]. + * @param freqItemsets DataFrame containing frequent itemset obtained from algorithms like + * [[FPGrowth]]. Users can set itemsCol (frequent itemSet, Array[String]) --- End diff -- `Array[String]` confilct with `Array[Int]` in https://github.com/apache/spark/pull/15415/files#diff-0a641720038f962d333ef38402a02207R41 and is there some way to support general types? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16451: [SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all ident...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16451 I just checked each except for the one below is passed in Windows via concatenated [full-log](https://gist.github.com/HyukjinKwon/2d199ac9156c380015ad5a71f77866be). It seems `DirectKafkaStreamSuite` is flaky due to the occasional failure of removing created temp directory (the same problem with https://github.com/apache/spark/pull/16451#discussion_r94562756 but that one is constantly failed) It is sometimes passed [![PR-16451](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=A7615F8B-58B0-4D9B-A914-32E7BF7DCB65=true)](https://ci.appveyor.com/project/spark-test/spark/branch/A7615F8B-58B0-4D9B-A914-32E7BF7DCB65) but sometimes failed as below: ``` DirectKafkaStreamSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite *** ABORTED *** (59 seconds, 626 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-426107da-68cf-4d94-b0d6-1f428f1c53f6 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14501][ML] spark.ml API for FPGrowth
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r94715503 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala --- @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, Params} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.fpm.{AssociationRules => MLlibAssociationRules} +import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} + +/** + * :: Experimental :: + * + * Generates association rules from frequent itemsets ("items", "freq"). This method only generates + * association rules which have a single item as the consequent. + */ +@Since("2.1.0") +@Experimental +class AssociationRules(override val uid: String) extends Params { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("AssociationRules")) + + /** + * Param for items column name. Items must be array of Integers. + * Default: "items" + * @group param + */ + final val itemsCol: Param[String] = new Param[String](this, "itemsCol", "items column name") + + + /** @group getParam */ + @Since("2.1.0") + final def getItemsCol: String = $(itemsCol) + + /** @group setParam */ + @Since("2.1.0") + def setItemsCol(value: String): this.type = set(itemsCol, value) + + /** + * Param for frequency column name. Data type should be Long. + * Default: "freq" + * @group param + */ + final val freqCol: Param[String] = new Param[String](this, "freqCol", "frequency column name") + + + /** @group getParam */ + @Since("2.1.0") + final def getFreqCol: String = $(freqCol) + + /** @group setParam */ + @Since("2.1.0") + def setFreqCol(value: String): this.type = set(freqCol, value) + + /** + * Param for minimum confidence, range [0.0, 1.0]. + * @group param + */ + final val minConfidence: DoubleParam = new DoubleParam(this, "minConfidence", "min confidence") --- End diff -- there should be a `ParamValidators.inRange(...)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16308 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70896/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16308 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16308: [SPARK-18936][SQL] Infrastructure for session local time...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16308 **[Test build #70896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70896/testReport)** for PR 16308 at commit [`5b6dd4f`](https://github.com/apache/spark/commit/5b6dd4f227e823937433c876fd6efa8d6fb75a0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class IndexToString @Since(\"2.2.0\") (@Since(\"1.5.0\") override val uid: String)` * `abstract class Collect[T <: Growable[Any] with Iterable[Any]] extends TypedImperativeAggregate[T] ` * `abstract class AggregateFunction extends Expression ` * `case class Literal (value: Any, dataType: DataType) extends LeafExpression ` * `case class LimitPushDown(conf: CatalystConf) extends Rule[LogicalPlan] ` * `trait TypedAggregateExpression extends AggregateFunction ` * `case class SimpleTypedAggregateExpression(` * `case class ComplexTypedAggregateExpression(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16401: [SPARK-18998] [SQL] Add a cbo conf to switch betw...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/16401#discussion_r94714976 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -95,6 +96,29 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { } /** + * Returns the default statistics or statistics estimated by cbo based on configuration. + */ + final def planStats(conf: CatalystConf): Statistics = { +if (conf.cboEnabled) { + if (estimatedStats.isEmpty) { +estimatedStats = Some(cboStatistics(conf)) + } + estimatedStats.get +} else { + statistics +} + } + + /** + * Returns statistics estimated by cbo. If the plan doesn't override this, it returns the + * default statistics. + */ + protected def cboStatistics(conf: CatalystConf): Statistics = statistics + + /** A cache for the estimated statistics, such that it will only be computed once. */ + private var estimatedStats: Option[Statistics] = None --- End diff -- `estimatedStats` is a cache for the `def cboStatistics()` where stats is calculated by using column stats. And because I want to pass conf into cboStatistics(some parameters may be needed during estimation in the future), I can't use a lazy val, and I do this cache explicitly with `estimatedStats`. Yes, the naming causes ambiguity, can you give it a better name? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12135: [SPARK-14352][SQL] approxQuantile should support multi c...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/12135 @jkbradley Updated. Thanks for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parq...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16474 **[Test build #70903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70903/testReport)** for PR 16474 at commit [`586b347`](https://github.com/apache/spark/commit/586b347b04b64ddf2b70e4fb16035f80ad5a400e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16474: [SPARK-19082][SQL] Make ignoreCorruptFiles work f...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/16474 [SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet ## What changes were proposed in this pull request? We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet: 1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html 2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too. This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc. Two main changes in this patch: 1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`. 2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator. One thing to notice is: We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`. One concern is that it might also shadow other runtime exceptions other than reading corrupt files. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 fix-ignorecorrupted-parquet-files Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16474.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16474 commit 586b347b04b64ddf2b70e4fb16035f80ad5a400e Author: Liang-Chi HsiehDate: 2017-01-05T04:02:13Z Make ignoreCorruptFiles work for Parquet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12135: [SPARK-14352][SQL] approxQuantile should support multi c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12135 **[Test build #70902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70902/testReport)** for PR 12135 at commit [`6517f21`](https://github.com/apache/spark/commit/6517f2186417fce1d0fcde1d5e8c561d686dcef0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16429: [SPARK-19019][PYTHON] Fix hijacked `collections.namedtup...
Github user azmras commented on the issue: https://github.com/apache/spark/pull/16429 @cxww107 Try to update both patched files in the following locations /usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark /usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/pyspark.zip See if it works.. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15671 **[Test build #70901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70901/testReport)** for PR 15671 at commit [`c8693d8`](https://github.com/apache/spark/commit/c8693d870373978b4a43b2215a1b487215107d45). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation for MLP,NB,LDA,AFT,...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/15671 @jkbradley Update according to your comments, including adding `quantileProbabilities` and `docConcentration`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15671: [SPARK-18206][ML]Add instrumentation for MLP,NB,L...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/15671#discussion_r94712844 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -905,7 +911,10 @@ class LDA @Since("1.6.0") ( case m: OldDistributedLDAModel => new DistributedLDAModel(uid, m.vocabSize, m, dataset.sparkSession, None) } -copyValues(newModel).setParent(this) + +val model = copyValues(newModel).setParent(this) --- End diff -- ok, I'll add it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...
Github user chpritchard-expedia commented on the issue: https://github.com/apache/spark/pull/16347 @junegunn I ran into the same issue, using partitionBy; missed it completely during my testing. Would you share the workaround you used? I wasn't able to understand it from your Apache JIRA posting. At present I'm thinking about using mapPartitions and writing each partition out from there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12775: [SPARK-14958][Core] Failed task not handled when there's...
Github user lirui-intel commented on the issue: https://github.com/apache/spark/pull/12775 I think the failure is due to one more [skipped test](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70890/testReport/pyspark.sql.tests/HiveContextSQLTests/test_unbounded_frames/). The skip message is > Unittest < 3.3 doesn't support mocking Any idea how that can be related to this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15314 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org