[GitHub] spark pull request #14433: [SPARK-16829][SparkR]:sparkR sc.setLogLevel doesn...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14433#discussion_r74883390 --- Diff: core/src/main/scala/org/apache/spark/internal/Logging.scala --- @@ -18,9 +18,9 @@ package org.apache.spark.internal import org.apache.log4j.{Level, LogManager, PropertyConfigurator} +import org.apache.spark.deploy.{SparkShellType, SparkSubmit} import org.slf4j.{Logger, LoggerFactory} import org.slf4j.impl.StaticLoggerBinder - import org.apache.spark.util.Utils --- End diff -- as per [this](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports) `import org.apache.spark.deploy` should go here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14035: [SPARK-16356][ML] Add testImplicits for ML unit tests an...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14035 Hi @jkbradley, could you take a look for this one please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14182: [SPARK-16444][SparkR]: Isotonic Regression wrappe...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14182#discussion_r74883011 --- Diff: R/pkg/R/mllib.R --- @@ -533,6 +630,25 @@ setMethod("write.ml", signature(object = "KMeansModel", path = "character"), invisible(callJMethod(writer, "save", path)) }) +# Save fitted IsotonicRegressionModel to the input path + +#' @param path The directory where the model is saved +#' @param overwrite Overwrites or not if the output path already exists. Default is FALSE +#' which means throw exception if the output path exists. +#' +#' @rdname spark.isoreg +#' @aliases write.ml,IsotonicRegressionModel-method --- End diff -- `#' @aliases write.ml,IsotonicRegressionModel,character-method` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14656: [SPARK-17069] Expose spark.range() as table-value...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14656#discussion_r74882826 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTableValuedFunctions.scala --- @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Range} +import org.apache.spark.sql.catalyst.rules._ + +/** + * Rule that resolves table-valued function references. + */ +object ResolveTableValuedFunctions extends Rule[LogicalPlan] { + private lazy val defaultParallelism = +SparkContext.getOrCreate(new SparkConf(false)).defaultParallelism + + /** + * List of argument names and their types, used to declare a function. + */ + private case class ArgumentList(args: (String, Class[_])*) { +/** + * @return whether this list is assignable from the given sequence of values. + */ +def assignableFrom(values: Seq[Any]): Boolean = { + if (args.length == values.length) { +args.zip(values).forall { case ((name, clazz), value) => + clazz.isAssignableFrom(value.getClass) +} + } else { +false + } +} + +override def toString: String = { + args.map { a => +s"${a._1}: ${a._2.getSimpleName}" + }.mkString(", ") +} + } + + /** + * A TVF maps argument lists to resolver functions that accept those arguments. Using a map + * here allows for function overloading. + */ + private type TVF = Map[ArgumentList, Seq[Any] => LogicalPlan] + + /** + * Internal registry of table-valued functions. TODO(ekl) we should have a proper registry + */ + private val builtinFunctions: Map[String, TVF] = Map( --- End diff -- So I think the catalyst way to do this is that the resolution rule only works if all the children of the expression is resolved, and then you have a checkanalysis rule that shows the error if there is an UnresolvedTableValuedFunction --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14182: [SPARK-16444][SparkR]: Isotonic Regression wrappe...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14182#discussion_r74882707 --- Diff: R/pkg/R/mllib.R --- @@ -299,6 +308,94 @@ setMethod("summary", signature(object = "NaiveBayesModel"), return(list(apriori = apriori, tables = tables)) }) +#' Isotonic Regression Model +#' +#' Fits an Isotonic Regression model against a Spark DataFrame, similarly to R's isoreg(). +#' Users can print, make predictions on the produced model and save the model to the input path. +#' +#' @param data SparkDataFrame for training +#' @param formula A symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param isotonic Whether the output sequence should be isotonic/increasing (TRUE) or +#' antitonic/decreasing (FALSE) +#' @param featureIndex The index of the feature if \code{featuresCol} is a vector column (default: `0`), +#' no effect otherwise +#' @param weightCol The weight column name. +#' @return \code{spark.isoreg} returns a fitted Isotonic Regression model +#' @rdname spark.isoreg +#' @aliases spark.isoreg,SparkDataFrame,formula-method +#' @name spark.isoreg +#' @export +#' @examples +#' \dontrun{ +#' sparkR.session() +#' data <- list(list(7.0, 0.0), list(5.0, 1.0), list(3.0, 2.0), +#' list(5.0, 3.0), list(1.0, 4.0)) +#' df <- createDataFrame(data, c("label", "feature")) +#' model <- spark.isoreg(df, label ~ feature, isotonic = FALSE) +#' # return model boundaries and prediction as lists +#' result <- summary(model, df) +#' # prediction based on fitted model +#' predict_data <- list(list(-2.0), list(-1.0), list(0.5), +#' list(0.75), list(1.0), list(2.0), list(9.0)) +#' predict_df <- createDataFrame(predict_data, c("feature")) +#' # get prediction column +#' predict_result <- collect(select(predict(model, predict_df), "prediction")) +#' +#' # save fitted model to input path +#' path <- "path/to/model" +#' write.ml(model, path) +#' +#' # can also read back the saved model and print +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.isoreg since 2.1.0 +setMethod("spark.isoreg", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, isotonic = TRUE, featureIndex = 0, weightCol = NULL) { +formula <- paste0(deparse(formula), collapse = "") + +if (is.null(weightCol)) { + weightCol <- "" +} + +jobj <- callJStatic("org.apache.spark.ml.r.IsotonicRegressionWrapper", "fit", +data@sdf, formula, as.logical(isotonic), as.integer(featureIndex), + as.character(weightCol)) +return(new("IsotonicRegressionModel", jobj = jobj)) + }) + +# Predicted values based on an isotonicRegression model + +#' @param object a fitted IsotonicRegressionModel +#' @param newData SparkDataFrame for testing +#' @return \code{predict} returns a SparkDataFrame containing predicted values +#' @rdname spark.isoreg +#' @aliases predict,SparkDataFrame,SparkDataFrame-method --- End diff -- `#' @aliases predict, IsotonicRegressionModel,SparkDataFrame-method` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14392#discussion_r74882538 --- Diff: R/pkg/R/mllib.R --- @@ -717,8 +717,9 @@ setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", formula = # Get the summary of a multivariate gaussian mixture model -#' @param object A fitted gaussian mixture model -#' @return \code{summary} returns the model's lambda, mu, sigma and posterior +#' @param object a fitted gaussian mixture model. +#' @param ... additional argument(s) passed to the method. --- End diff -- I'd say "Currently not used" instead for this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14392 only one last comment, LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14558 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63830/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14558 **[Test build #63830 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63830/consoleFull)** for PR 14558 at commit [`e5771a1`](https://github.com/apache/spark/commit/e5771a1d18366b2b4ace2ffba49bbe89312e0acd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13680#discussion_r74881945 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java --- @@ -25,55 +25,57 @@ import org.apache.spark.sql.types.*; import org.apache.spark.unsafe.Platform; import org.apache.spark.unsafe.array.ByteArrayMethods; +import org.apache.spark.unsafe.bitset.BitSetMethods; import org.apache.spark.unsafe.hash.Murmur3_x86_32; import org.apache.spark.unsafe.types.CalendarInterval; import org.apache.spark.unsafe.types.UTF8String; /** * An Unsafe implementation of Array which is backed by raw memory instead of Java objects. * - * Each tuple has three parts: [numElements] [offsets] [values] + * Each array has four parts: [numElements][null bits][values or offset][variable length portion] * - * The `numElements` is 4 bytes storing the number of elements of this array. + * The `numElements` is 8 bytes storing the number of elements of this array. * - * In the `offsets` region, we store 4 bytes per element, represents the relative offset (w.r.t. the - * base address of the array) of this element in `values` region. We can get the length of this - * element by subtracting next offset. - * Note that offset can by negative which means this element is null. + * In the `null bits` region, we store 1 bit per element, represents whether a element has null + * Its total size is ceil(numElements / 8) bytes, and it is aligned to 8-byte word boundaries. * - * In the `values` region, we store the content of elements. As we can get length info, so elements - * can be variable-length. + * In the `values or offset` region, we store the content of elements. For fields that hold + * fixed-length primitive types, such as long, double, or int, we store the value directly + * in the field. For fields with non-primitive or variable-length values, we store a relative + * offset (w.r.t. the base address of the array) that points to the beginning of + * the variable-length field into int. It can only be calculated by knowing the total bytes of --- End diff -- i see, makes sense, thanks for the explanation! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14658: [WIP][SPARK-5928][SPARK-6238] Remote Shuffle Blocks cann...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14658 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14658: [WIP][SPARK-5928][SPARK-6238] Remote Shuffle Blocks cann...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14658 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63823/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13758 no, by generating unsafe array directly, we will have an additional copy. But I don't think that matters too much, and worth such a big change. Anyway let's make the new unsafe array in first, and we can be back to this PR later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14658: [WIP][SPARK-5928][SPARK-6238] Remote Shuffle Blocks cann...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14658 **[Test build #63823 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63823/consoleFull)** for PR 14658 at commit [`443aa91`](https://github.com/apache/spark/commit/443aa91cfc2490be9733c78b7cd911f09bedfac6). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class ChunkFetchInputStream extends InputStream ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14648: [SPARK-16995][SQL] TreeNodeException when flat ma...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14648#discussion_r74881208 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -727,6 +727,13 @@ object FoldablePropagation extends Rule[LogicalPlan] { case j @ Join(_, _, LeftOuter | RightOuter | FullOuter, _) => stop = true j + +// Operators that operate on objects should only have expressions from encoders, which +// should never have foldable expressions. +case o: ObjectConsumer => o --- End diff -- Ok. I will update once I return back to laptop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14648: [SPARK-16995][SQL] TreeNodeException when flat ma...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14648#discussion_r74880666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -727,6 +727,13 @@ object FoldablePropagation extends Rule[LogicalPlan] { case j @ Join(_, _, LeftOuter | RightOuter | FullOuter, _) => stop = true j + +// Operators that operate on objects should only have expressions from encoders, which +// should never have foldable expressions. +case o: ObjectConsumer => o --- End diff -- we should follow other cases, to set `stop` to `true` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14558 **[Test build #63830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63830/consoleFull)** for PR 14558 at commit [`e5771a1`](https://github.com/apache/spark/commit/e5771a1d18366b2b4ace2ffba49bbe89312e0acd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14656: [SPARK-17069] Expose spark.range() as table-valued funct...
Github user ericl commented on the issue: https://github.com/apache/spark/pull/14656 Updated to use catalyst type coercion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14656: [SPARK-17069] Expose spark.range() as table-valued funct...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14656 **[Test build #63829 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63829/consoleFull)** for PR 14656 at commit [`2f80f54`](https://github.com/apache/spark/commit/2f80f549dd3d765fd3fdc63f9795d8e5562e38fa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14601: [SPARK-13979][Core][WIP]Killed executor is re spa...
Github user agsachin commented on a diff in the pull request: https://github.com/apache/spark/pull/14601#discussion_r74880042 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala --- @@ -107,6 +107,14 @@ class SparkHadoopUtil extends Logging { if (key.startsWith("spark.hadoop.")) { --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user jodersky commented on the issue: https://github.com/apache/spark/pull/14359 > I switched to Stack and then realized Stack has been deprecated in Scala 2.11... I think you probably read the *immutable* stack docs; the *mutable* stack is not deprecated AFAIK. I can imagine that having a custom stack implementation may allow for additional operations in the future, however we should also consider that using standard collections reduces the load for anyone who will maintain the code then. Btw, I highly recommend to use the [milestone scaladocs](http://www.scala-lang.org/api/2.12.0-M5/scala/collection/mutable/Stack.html) over the current ones. Although 2.12 is not officially out yet, the changes to the library are minimal and the UI is much more pleasant to use. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Thank you for reviews so far, @gatorsmile , @hvanhovell , @nsyca , @yhuai . I'm closing this PR. I'm looking forward to see @gatorsmile 's PR and to get better master branch soon. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optim...
Github user dongjoon-hyun closed the pull request at: https://github.com/apache/spark/pull/14580 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/13796#discussion_r74879829 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala --- @@ -0,0 +1,626 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import scala.collection.mutable + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.linalg.VectorImplicits._ +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.sql.functions.{col, lit} +import org.apache.spark.sql.types.DoubleType +import org.apache.spark.storage.StorageLevel + +/** + * Params for multinomial logistic regression. + */ +private[classification] trait MultinomialLogisticRegressionParams + extends ProbabilisticClassifierParams with HasRegParam with HasElasticNetParam with HasMaxIter +with HasFitIntercept with HasTol with HasStandardization with HasWeightCol { + + /** + * Set thresholds in multiclass (or binary) classification to adjust the probability of + * predicting each class. Array must have length equal to the number of classes, with values >= 0. + * The class with largest value p/t is predicted, where p is the original probability of that + * class and t is the class' threshold. + * + * @group setParam + */ + def setThresholds(value: Array[Double]): this.type = { +set(thresholds, value) + } + + /** + * Get thresholds for binary or multiclass classification. + * + * @group getParam + */ + override def getThresholds: Array[Double] = { +$(thresholds) + } +} + +/** + * :: Experimental :: + * Multinomial Logistic regression. + */ +@Since("2.1.0") +@Experimental +class MultinomialLogisticRegression @Since("2.1.0") ( +@Since("2.1.0") override val uid: String) + extends ProbabilisticClassifier[Vector, +MultinomialLogisticRegression, MultinomialLogisticRegressionModel] +with MultinomialLogisticRegressionParams with DefaultParamsWritable with Logging { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("mlogreg")) + + /** + * Set the regularization parameter. + * Default is 0.0. + * + * @group setParam + */ + @Since("2.1.0") + def setRegParam(value: Double): this.type = set(regParam, value) + + setDefault(regParam -> 0.0) + + /** + * Set the ElasticNet mixing parameter. + * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. + * For 0 < alpha < 1, the penalty is a combination of L1 and L2. + * Default is 0.0 which is an L2 penalty. + * + * @group setParam + */ + @Since("2.1.0") + def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value) + + setDefault(elasticNetParam -> 0.0) + + /** + * Set the maximum number of iterations. + * Default is 100. + * + * @group setParam + */ + @Since("2.1.0") + def setMaxIter(value: Int): this.type = set(maxIter, value) + + setDefault(maxIter -> 100) + + /** + * Set the convergence tolerance of iterations. + * Smaller value will lead to higher accuracy with the cost of more iterations. + * Default is 1E
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Never mind. I always appreciate a lot for your review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 Sorry, I gave a wrong answer at the beginning. Next time, I will review it more carefully before leaving the comment. Thank you for your work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14660: [SPARK-17071][SQL] Fetch Parquet schema without another ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14660 **[Test build #63828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63828/consoleFull)** for PR 14660 at commit [`e1214d5`](https://github.com/apache/spark/commit/e1214d50035441fb96551683cf38ae3e49f07b7d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 :) I think about this issue again. At this stage, could you make a PR for this? I think you're the best person to do that. You made this optimizer and found the correct fix. This was a nice change to investigate this optimizer and nullability propagation for me. @gatorsmile . Thank you for reviewing this. I'll close this PR soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/14660 [SPARK-17071][SQL] Fetch Parquet schema without another Spark job when it is a single file to touch ## What changes were proposed in this pull request? It seems Spark executes another job to figure out schema always ([ParquetFileFormat#L739-L778](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L739-L778)). However, it seems it's a bit of overhead to touch only a single file. I ran a bench mark with the code below: ```scala test("Benchmark for JSON writer") { withTempPath { path => Seq((1, 2D, 3L, "4")).toDF("a", "b", "c", "d") .write.format("parquet").save(path.getAbsolutePath) val benchmark = new Benchmark("Parquet - read schema", 1) benchmark.addCase("Parquet - read schema", 10) { _ => spark.read.format("parquet").load(path.getCanonicalPath).schema } benchmark.run() } } ``` with the results as below: - **Before** ```scala Parquet - read schema: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet - read schema 47 / 49 0.0 46728419.0 1.0X ``` - **After** ```scala Parquet - read schema: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet - read schema2 /3 0.0 1811673.0 1.0X ``` It seems it became 20X faster (although It is a small bit in total job run-time). As a reference, it seems ORC is doing this within driver-side [OrcFileOperator.scala#L74-L83](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L83). ## How was this patch tested? Existing tests should cover this You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-17071 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14660.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14660 commit 614abbc6b7a03ff0d3e505697c0bbfec3b330c2b Author: hyukjinkwon Date: 2016-08-16T05:42:29Z Fetch Parquet schema within driver-side when there is single file to touch without another Spark job commit e1214d50035441fb96551683cf38ae3e49f07b7d Author: hyukjinkwon Date: 2016-08-16T05:46:12Z Fix modifier --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/13796 @sethah Thank you for great work. I'll make another pass tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 One more try: ```Scala val splitConjunctiveConditions: Seq[Expression] = splitConjunctivePredicates(filter.condition) val conditions = splitConjunctiveConditions ++ filter.constraints val leftConditions = conditions.filter(_.references.subsetOf(join.left.outputSet)) val rightConditions = conditions.filter(_.references.subsetOf(join.right.outputSet)) val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) ``` Does this have a hole? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14616 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63821/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14616 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 Another version. : ) ```Scala val splitConjunctiveConditions: Seq[Expression] = splitConjunctivePredicates(filter.condition) val conditions = splitConjunctiveConditions ++ filter.constraints.filter(_.isInstanceOf[IsNotNull]) val leftConditions = conditions.filter(_.references.subsetOf(join.left.outputSet)) val rightConditions = conditions.filter(_.references.subsetOf(join.right.outputSet)) val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14616 **[Test build #63821 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63821/consoleFull)** for PR 14616 at commit [`db84e25`](https://github.com/apache/spark/commit/db84e259749e6b339367fd42305f92a224407399). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14392 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63827/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14392 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14392 **[Test build #63827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63827/consoleFull)** for PR 14392 at commit [`05afe23`](https://github.com/apache/spark/commit/05afe2342648160165722f483cd69251826cb68e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 How about another version? ``` val leftConditions = (splitConjunctiveConditions ++ filter.constraints.filter(_.isInstanceOf[IsNotNull])) .filter(_.references.subsetOf(join.left.outputSet)) val rightConditions = (splitConjunctiveConditions ++ filter.constraints.filter(_.isInstanceOf[IsNotNull])) .filter(_.references.subsetOf(join.right.outputSet)) val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Oh, that would be perfect fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63826/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #63826 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63826/consoleFull)** for PR 14182 at commit [`fa69bc6`](https://github.com/apache/spark/commit/fa69bc6a045322de52e55666bcc2a04cd8486b36). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 How about this fix? ``` val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => expr.references.subsetOf(join.left.outputSet) && canFilterOutNull(expr)) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => expr.references.subsetOf(join.right.outputSet) && canFilterOutNull(expr)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/13796 @dbtsai Thanks for taking the time to review this! Major items right now: * Adding derivation to the aggregator doc (this is mostly finished, just fighting scala doc with Latex) * Deciding whether to add initial model and tests in this PR or as a follow up * Refactoring logistic regression helper classes to a separate file Let me know if you see anything else. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14392 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63825/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14392 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14392 **[Test build #63825 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63825/consoleFull)** for PR 14392 at commit [`cc708b5`](https://github.com/apache/spark/commit/cc708b549455ad1d850e86198a84060086d30386). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13796#discussion_r74876946 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala --- @@ -0,0 +1,626 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.classification + +import scala.collection.mutable + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.internal.Logging +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.linalg.VectorImplicits._ +import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{Dataset, Row} +import org.apache.spark.sql.functions.{col, lit} +import org.apache.spark.sql.types.DoubleType +import org.apache.spark.storage.StorageLevel + +/** + * Params for multinomial logistic regression. + */ +private[classification] trait MultinomialLogisticRegressionParams + extends ProbabilisticClassifierParams with HasRegParam with HasElasticNetParam with HasMaxIter +with HasFitIntercept with HasTol with HasStandardization with HasWeightCol { + + /** + * Set thresholds in multiclass (or binary) classification to adjust the probability of + * predicting each class. Array must have length equal to the number of classes, with values >= 0. + * The class with largest value p/t is predicted, where p is the original probability of that + * class and t is the class' threshold. + * + * @group setParam + */ + def setThresholds(value: Array[Double]): this.type = { +set(thresholds, value) + } + + /** + * Get thresholds for binary or multiclass classification. + * + * @group getParam + */ + override def getThresholds: Array[Double] = { +$(thresholds) + } +} + +/** + * :: Experimental :: + * Multinomial Logistic regression. + */ +@Since("2.1.0") +@Experimental +class MultinomialLogisticRegression @Since("2.1.0") ( +@Since("2.1.0") override val uid: String) + extends ProbabilisticClassifier[Vector, +MultinomialLogisticRegression, MultinomialLogisticRegressionModel] +with MultinomialLogisticRegressionParams with DefaultParamsWritable with Logging { + + @Since("2.1.0") + def this() = this(Identifiable.randomUID("mlogreg")) + + /** + * Set the regularization parameter. + * Default is 0.0. + * + * @group setParam + */ + @Since("2.1.0") + def setRegParam(value: Double): this.type = set(regParam, value) + + setDefault(regParam -> 0.0) + + /** + * Set the ElasticNet mixing parameter. + * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. + * For 0 < alpha < 1, the penalty is a combination of L1 and L2. + * Default is 0.0 which is an L2 penalty. + * + * @group setParam + */ + @Since("2.1.0") + def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value) + + setDefault(elasticNetParam -> 0.0) + + /** + * Set the maximum number of iterations. + * Default is 100. + * + * @group setParam + */ + @Since("2.1.0") + def setMaxIter(value: Int): this.type = set(maxIter, value) + + setDefault(maxIter -> 100) + + /** + * Set the convergence tolerance of iterations. + * Smaller value will lead to higher accuracy with the cost of more iterations. + * Default is 1E
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14182 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63824/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #63824 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63824/consoleFull)** for PR 14182 at commit [`8844961`](https://github.com/apache/spark/commit/884496153f9aa512bc437c1c23361479b6b2bc7b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14359 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63822/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14359 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14359 **[Test build #63822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)** for PR 14359 at commit [`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14392 **[Test build #63827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63827/consoleFull)** for PR 14392 at commit [`05afe23`](https://github.com/apache/spark/commit/05afe2342648160165722f483cd69251826cb68e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 Another better fix is to use `nullable` in `Expression` for `IsNotNull` constraints. `filter.constraints.filter(_.isInstanceOf[IsNotNull])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 `canFilterOutNull ` will cover almost all the cases. Sorry, I did not read the plan until you asked me to write a test case. Then, I realized the implementation of natural/using join is just using `coalesce`. As @hvanhovell and @nsyca said, that is just a syntactic sugar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #63826 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63826/consoleFull)** for PR 14182 at commit [`fa69bc6`](https://github.com/apache/spark/commit/fa69bc6a045322de52e55666bcc2a04cd8486b36). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14506: [SPARK-16916][SQL] serde/storage properties shoul...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14506 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14659 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Please let me think more on this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
GitHub user Sherry302 opened a pull request: https://github.com/apache/spark/pull/14659 [SPARK-16757] Set up Spark caller context to HDFS ## What changes were proposed in this pull request? 1. Pass `jobId` to Task. 2. Invoke Hadoop APIs. A new function `setCallerContext` is added in `Utils`. `setCallerContext` function invokes APIs of `org.apache.hadoop.ipc.CallerContext` to set up spark caller contexts, which will be written into `hdfs-audit.log`. For applications in Yarn client mode, `org.apache.hadoop.ipc.CallerContext` are called in `Task` and Yarn `Client`. For applications in Yarn cluster mode, `org.apache.hadoop.ipc.CallerContext` are be called in `Task` and `ApplicationMaster`. The Spark caller contexts written into `hdfs-audit.log` are applications' name` {spark.app.name}` and `JobID_stageID_stageAttemptId_taskID_attemptNumbe`. ## How was this patch tested? Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log successfully. For example, run SparkKmeans in Yarn client mode: `./bin/spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5` Before: There will be no Spark caller context in records of `hdfs-audit.log`. After: Spark caller contexts will be in records of `hdfs-audit.log`. (_Note: spark caller context below since Hadoop caller context API was invoked in Yarn Client_) `2016-07-21 13:52:30,802 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo src=/lr_big.txtdst=nullperm=nullproto=rpc callerContext=SparkKMeans running on Spark ` (_Note: spark caller context below since Hadoop caller context API was invoked in Task_) `2016-07-21 13:52:35,584 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=open src=/lr_big.txtdst=nullperm=nullproto=rpc callerContext=JobId_0_StageID_0_stageAttemptId_0_taskID_0_attemptNumber_0` You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sherry302/spark callercontextSubmit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14659.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14659 commit ec6833d32ef14950b2d81790bc908992f6288815 Author: Weiqing Yang Date: 2016-08-16T04:11:41Z [SPARK-16757] Set up Spark caller context to HDFS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 Yep. I agree. `Expr` could be anything. However, this will reduce the scope of this optimization greatly. Is it okay for you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14506 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14506 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63818/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14506 Thanks. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 If that is not applicable, I agree with @gatorsmile . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 That just resolves a specific case. The expressions could be much more complex. `Coalesce` can be used in a very deep layer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14506 **[Test build #63818 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63818/consoleFull)** for PR 14506 at commit [`3042af2`](https://github.com/apache/spark/commit/3042af2f0e9ae82e40d14e950a1036b9e417dbc9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14580 What about this if we could exclude those functions? ```scala val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) -.exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty) +.exists(expr => !expr.isInstanceOf[Coalesce] && + leftOuterAttributeSet.intersect(expr.references).nonEmpty) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) -.exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty) +.exists(expr => !expr.isInstanceOf[Coalesce] && + rightOuterAttributeSet.intersect(expr.references).nonEmpty) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14447 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14447 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63820/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14447 **[Test build #63820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63820/consoleFull)** for PR 14447 at commit [`7c94e2b`](https://github.com/apache/spark/commit/7c94e2ba11655cbd9275793f6c069ab3ba844238). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14558#discussion_r74874929 --- Diff: R/pkg/R/functions.R --- @@ -1143,7 +1139,7 @@ setMethod("minute", #' @export #' @examples \dontrun{select(df, monotonically_increasing_id())} setMethod("monotonically_increasing_id", - signature(x = "missing"), + signature(), --- End diff -- Automatic generation of S4 methods is not desirable. I hope this case can be better handled by roxygen. For now, I agree (b) is a good solution to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 The right fix is to change the following statements ```Scala val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) || filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty) ``` to the following ones: ```Scala val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 Sorry, my above description is not clear. `isnotnull(coalesce(b#227, c#238))` does not filter out `NULL` of `b#227` and `c#238`. Only when both are `b#227` and `c#238` are `NULL`, `coalesce(b#227, c#238)` returns `NULL`. Thus, we are unable to use the following two statements to conclude whether left or right has Non-Null predicates. ```Scala filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty) ``` and ```Scala filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14392 **[Test build #63825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63825/consoleFull)** for PR 14392 at commit [`cc708b5`](https://github.com/apache/spark/commit/cc708b549455ad1d850e86198a84060086d30386). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14580 Can you explain `isnotnull(coalesce(b#227, c#238)) does not filter out NULL!!!`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14359 Btw, to give back-of-the-envelope estimates, we can look at 2 numbers: (1) How many nodes will be split on each iteration? (2) How big is the forest which is serialized and sent to workers on each iteration? For (1), here's an example: * 1000 features, each with 50 bins -> 5 possible splits * set maxMemoryInMB = 256 (default) * regression => 3 Double values per possible split * 256 * 10^6 / (3 * 5 * 8) = 213 nodes/iteration This implies that for trees of depth > 8 or so, many iterations will only split nodes from 1 or 2 trees. I.e., we should avoid communicating most trees. For (2), the forest can be pretty expensive to send. * Each node: * leaf node: 5 Doubles * internal node: ~8 Doubles/references + Split * Split: O(# categories) or 2 values for continuous, say 3 Doubles on average * => say 8 Doubles/node on average * 100 trees of depth 8 => 25600 nodes => 1.6MB * 100 trees of depth 14 => 105MB * I've heard of many cases of users wanting to fit 500-1000 trees and use trees of depth 18-20. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14182 **[Test build #63824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63824/consoleFull)** for PR 14182 at commit [`8844961`](https://github.com/apache/spark/commit/884496153f9aa512bc437c1c23361479b6b2bc7b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14658: [WIP][SPARK-5928] Remote Shuffle Blocks cannot be more t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14658 **[Test build #63823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63823/consoleFull)** for PR 14658 at commit [`443aa91`](https://github.com/apache/spark/commit/443aa91cfc2490be9733c78b7cd911f09bedfac6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 ```Scala val df12 = df1.join(df2, $"df1.a" === $"df2.a", "fullouter") .select(coalesce($"df1.b", $"df2.c").as("a"), $"df1.b", $"df2.c") df12.join(df3, "a").explain(true) ``` This is an example to show that we should not eliminate the outer join, even if `isnotnull(coalesce(b#227, c#238))` contains the attributes that are not in join conditions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14558#discussion_r74874081 --- Diff: R/pkg/R/SQLContext.R --- @@ -181,7 +181,7 @@ getDefaultSqlSource <- function() { #' @method createDataFrame default #' @note createDataFrame since 1.4.0 # TODO(davies): support sampling and infer type from NA -createDataFrame.default <- function(data, schema = NULL, samplingRatio = 1.0) { +createDataFrame.default <- function(data, schema = NULL) { --- End diff -- Oh yes... Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14658: [WIP][SPARK-5928] Remote Shuffle Blocks cannot be...
GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/14658 [WIP][SPARK-5928] Remote Shuffle Blocks cannot be more than 2 GB ## What changes were proposed in this pull request? Add class `ChunkFetchInputStream` and it have the following effects: 1. flow control [WIP] 2. reduce memory usage [WIP] 3. unlimited size [WIP] ## How was this patch tested? WIP You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark SPARK-5928_Shuffle_Blocks_2G Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14658.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14658 commit 443aa91cfc2490be9733c78b7cd911f09bedfac6 Author: Guoqiang Li Date: 2016-08-16T04:00:10Z Remote Shuffle Blocks cannot be more than 2 GB --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/14392#discussion_r74873932 --- Diff: R/pkg/R/generics.R --- @@ -1279,6 +1279,13 @@ setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("s #' @export setGeneric("spark.survreg", function(data, formula, ...) { standardGeneric("spark.survreg") }) +#' @rdname spark.gaussianMixture +#' @export +setGeneric("spark.gaussianMixture", + function(data, formula, ...) { + standardGeneric("spark.gaussianMixture") --- End diff -- It can not fit one line, since ```lint-r``` requires lines should not be more than 100 characters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...
Github user junyangq commented on a diff in the pull request: https://github.com/apache/spark/pull/14558#discussion_r74873867 --- Diff: R/pkg/R/mllib.R --- @@ -298,14 +304,15 @@ setMethod("summary", signature(object = "NaiveBayesModel"), #' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make #' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. #' -#' @param data SparkDataFrame for training -#' @param formula A symbolic description of the model to be fitted. Currently only a few formula +#' @param data a SparkDataFrame for training. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula #'operators are supported, including '~', '.', ':', '+', and '-'. #'Note that the response variable of formula is empty in spark.kmeans. -#' @param k Number of centers -#' @param maxIter Maximum iteration number -#' @param initMode The initialization algorithm choosen to fit the model -#' @return \code{spark.kmeans} returns a fitted k-means model +#' @param ... additional argument(s) passed to the method. --- End diff -- Yeah agreed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14628: [SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/14628 @holdenk I think depth (2) is enough to handle large RDD and bigger depth may add cost. I'll append test result later. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14359 **[Test build #63822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)** for PR 14359 at commit [`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 None of us is right. : ( ```isnotnull(coalesce(b#227, c#238))``` does not filter out `NULL`!!! Thus, the right fix is to remove the second condition. ```Scala filter.constraints.filter(_.isInstanceOf[IsNotNull]).exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty) ``` and ```Scala filter.constraints.filter(_.isInstanceOf[IsNotNull]) .exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14359 Sorry for the long delay; I've been swamped by other things for a while. Re-emerging... I switched to Stack and then realized Stack has been deprecated in Scala 2.11, so I reverted to the original NodeQueue. But I renamed NodeQueue to NodeStack to be a bit clearer. @hhbyyh Any luck testing this at scale? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14580 I found the root cause. None of us is right. : ( --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14647: [WIP][Test only][DEMO][SPARK-6235]Address various 2G lim...
Github user witgo commented on the issue: https://github.com/apache/spark/pull/14647 @hvanhovell I will submit some small PRs and provide a more high level description of them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 You are right. I missed `UnsafeArrayData` is a subclass of `ArrayData`. We can pass `UnsafeArrayData` to an projection. I have one question. When we directly generate `UnsafeArrayData` from a primitive array and copy it into an `InternalRow` (`serializefromobject_result`), the following two operations are required: 1. Copy from a primitive array to `UnsafeArrayData` 2. Copy from `UnsafeArrayData` into `InternalRow` at line 102 On the other hand, this PR requires the following one operation 0. (No copy happens at line 086 since this PR just store a reference to a primitive array in `GenericArrayData`) 1. Copy from a primitive array to `InternalRow` ([this PR](https://github.com/apache/spark/pull/13911) performs `Platform.copy Memory` without no iteration. Can we avoid additional copy at 2. when we directly generate `UnsafeArrayData` from a primitive array? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14616 **[Test build #63821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63821/consoleFull)** for PR 14616 at commit [`db84e25`](https://github.com/apache/spark/commit/db84e259749e6b339367fd42305f92a224407399). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14649 Also, if my understanding is correct, we are picking up only single file to read footer (see [ParquetFileFormat.scala#L217-L225](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L217-L225)) unless we merge schemas. So, it seems, due to this reason, writing `_metadata` or `_common_metadata` is disabled (See https://issues.apache.org/jira/browse/SPARK-15719). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14649#discussion_r74872775 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -423,6 +425,54 @@ class ParquetFileFormat sqlContext.sessionState.newHadoopConf(), options) } + + override def filterPartitions( + filters: Seq[Filter], + schema: StructType, + conf: Configuration, + allFiles: Seq[FileStatus], + root: Path, + partitions: Seq[Partition]): Seq[Partition] = { +// Read the "_metadata" file if available, contains all block headers. On S3 better to grab +// all of the footers in a batch rather than having to read every single file just to get its +// footer. +allFiles.find(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE).map { stat => + val metadata = ParquetFileReader.readFooter(conf, stat, ParquetMetadataConverter.NO_FILTER) + partitions.map { part => +filterByMetadata( + filters, + schema, + conf, + root, + metadata, + part) + }.filterNot(_.files.isEmpty) +}.getOrElse(partitions) + } + + private def filterByMetadata( + filters: Seq[Filter], + schema: StructType, + conf: Configuration, + root: Path, + metadata: ParquetMetadata, + partition: Partition): Partition = { +val blockMetadatas = metadata.getBlocks.asScala +val parquetSchema = metadata.getFileMetaData.getSchema +val conjunctiveFilter = filters + .flatMap(ParquetFilters.createFilter(schema, _)) + .reduceOption(FilterApi.and) +conjunctiveFilter.map { conjunction => + val filteredBlocks = RowGroupFilter.filterRowGroups( --- End diff -- Do you mind if I ask a question please? So, if my understanding is correct, Parquet filters rowgroups in both normal reader and vectorized reader already (https://github.com/apache/spark/pull/13701). Is this doing the same thing in Spark-side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14649#discussion_r74872795 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -423,6 +425,54 @@ class ParquetFileFormat sqlContext.sessionState.newHadoopConf(), options) } + + override def filterPartitions( + filters: Seq[Filter], + schema: StructType, + conf: Configuration, + allFiles: Seq[FileStatus], + root: Path, + partitions: Seq[Partition]): Seq[Partition] = { +// Read the "_metadata" file if available, contains all block headers. On S3 better to grab +// all of the footers in a batch rather than having to read every single file just to get its +// footer. +allFiles.find(_.getPath.getName == ParquetFileWriter.PARQUET_METADATA_FILE).map { stat => + val metadata = ParquetFileReader.readFooter(conf, stat, ParquetMetadataConverter.NO_FILTER) + partitions.map { part => +filterByMetadata( + filters, + schema, + conf, + root, + metadata, + part) + }.filterNot(_.files.isEmpty) +}.getOrElse(partitions) + } + + private def filterByMetadata( + filters: Seq[Filter], + schema: StructType, + conf: Configuration, + root: Path, + metadata: ParquetMetadata, + partition: Partition): Partition = { +val blockMetadatas = metadata.getBlocks.asScala +val parquetSchema = metadata.getFileMetaData.getSchema +val conjunctiveFilter = filters + .flatMap(ParquetFilters.createFilter(schema, _)) + .reduceOption(FilterApi.and) +conjunctiveFilter.map { conjunction => + val filteredBlocks = RowGroupFilter.filterRowGroups( --- End diff -- Also, doesn't this try to touch many files in driver-side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14447 **[Test build #63820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63820/consoleFull)** for PR 14447 at commit [`7c94e2b`](https://github.com/apache/spark/commit/7c94e2ba11655cbd9275793f6c069ab3ba844238). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org