date:20160815

[GitHub] spark pull request #14433: [SPARK-16829][SparkR]:sparkR sc.setLogLevel doesn...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14433#discussion_r74883390
  
--- Diff: core/src/main/scala/org/apache/spark/internal/Logging.scala ---
@@ -18,9 +18,9 @@
 package org.apache.spark.internal
 
 import org.apache.log4j.{Level, LogManager, PropertyConfigurator}
+import org.apache.spark.deploy.{SparkShellType, SparkSubmit}
 import org.slf4j.{Logger, LoggerFactory}
 import org.slf4j.impl.StaticLoggerBinder
-
 import org.apache.spark.util.Utils
--- End diff --

as per 
[this](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports)
  `import org.apache.spark.deploy` should go here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14035: [SPARK-16356][ML] Add testImplicits for ML unit tests an...

2016-08-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14035
  
Hi @jkbradley, could you take a look for this one please? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14182: [SPARK-16444][SparkR]: Isotonic Regression wrappe...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14182#discussion_r74883011
  
--- Diff: R/pkg/R/mllib.R ---
@@ -533,6 +630,25 @@ setMethod("write.ml", signature(object = 
"KMeansModel", path = "character"),
 invisible(callJMethod(writer, "save", path))
   })
 
+#  Save fitted IsotonicRegressionModel to the input path
+
+#' @param path The directory where the model is saved
+#' @param overwrite Overwrites or not if the output path already exists. 
Default is FALSE
+#'  which means throw exception if the output path exists.
+#'
+#' @rdname spark.isoreg
+#' @aliases write.ml,IsotonicRegressionModel-method
--- End diff --

`#' @aliases write.ml,IsotonicRegressionModel,character-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14656: [SPARK-17069] Expose spark.range() as table-value...

2016-08-15 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14656#discussion_r74882826
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTableValuedFunctions.scala
 ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.sql.catalyst.plans._
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Range}
+import org.apache.spark.sql.catalyst.rules._
+
+/**
+ * Rule that resolves table-valued function references.
+ */
+object ResolveTableValuedFunctions extends Rule[LogicalPlan] {
+  private lazy val defaultParallelism =
+SparkContext.getOrCreate(new SparkConf(false)).defaultParallelism
+
+  /**
+   * List of argument names and their types, used to declare a function.
+   */
+  private case class ArgumentList(args: (String, Class[_])*) {
+/**
+ * @return whether this list is assignable from the given sequence of 
values.
+ */
+def assignableFrom(values: Seq[Any]): Boolean = {
+  if (args.length == values.length) {
+args.zip(values).forall { case ((name, clazz), value) =>
+  clazz.isAssignableFrom(value.getClass)
+}
+  } else {
+false
+  }
+}
+
+override def toString: String = {
+  args.map { a =>
+s"${a._1}: ${a._2.getSimpleName}"
+  }.mkString(", ")
+}
+  }
+
+  /**
+   * A TVF maps argument lists to resolver functions that accept those 
arguments. Using a map
+   * here allows for function overloading.
+   */
+  private type TVF = Map[ArgumentList, Seq[Any] => LogicalPlan]
+
+  /**
+   * Internal registry of table-valued functions. TODO(ekl) we should have 
a proper registry
+   */
+  private val builtinFunctions: Map[String, TVF] = Map(
--- End diff --

So I think the catalyst way to do this is that the resolution rule only 
works if all the children of the expression is resolved, and then you have a 
checkanalysis rule that shows the error if there is an 
UnresolvedTableValuedFunction


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14182: [SPARK-16444][SparkR]: Isotonic Regression wrappe...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14182#discussion_r74882707
  
--- Diff: R/pkg/R/mllib.R ---
@@ -299,6 +308,94 @@ setMethod("summary", signature(object = 
"NaiveBayesModel"),
 return(list(apriori = apriori, tables = tables))
   })
 
+#' Isotonic Regression Model
+#'
+#' Fits an Isotonic Regression model against a Spark DataFrame, similarly 
to R's isoreg().
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param isotonic Whether the output sequence should be 
isotonic/increasing (TRUE) or
+#' antitonic/decreasing (FALSE)
+#' @param featureIndex The index of the feature if \code{featuresCol} is a 
vector column (default: `0`),
+#' no effect otherwise
+#' @param weightCol The weight column name.
+#' @return \code{spark.isoreg} returns a fitted Isotonic Regression model
+#' @rdname spark.isoreg
+#' @aliases spark.isoreg,SparkDataFrame,formula-method
+#' @name spark.isoreg
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' data <- list(list(7.0, 0.0), list(5.0, 1.0), list(3.0, 2.0),
+#' list(5.0, 3.0), list(1.0, 4.0))
+#' df <- createDataFrame(data, c("label", "feature"))
+#' model <- spark.isoreg(df, label ~ feature, isotonic = FALSE)
+#' # return model boundaries and prediction as lists
+#' result <- summary(model, df)
+#' # prediction based on fitted model
+#' predict_data <- list(list(-2.0), list(-1.0), list(0.5),
+#' list(0.75), list(1.0), list(2.0), list(9.0))
+#' predict_df <- createDataFrame(predict_data, c("feature"))
+#' # get prediction column
+#' predict_result <- collect(select(predict(model, predict_df), 
"prediction"))
+#'
+#' # save fitted model to input path
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#'
+#' # can also read back the saved model and print
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.isoreg since 2.1.0
+setMethod("spark.isoreg", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, isotonic = TRUE, featureIndex = 0, 
weightCol = NULL) {
+formula <- paste0(deparse(formula), collapse = "")
+
+if (is.null(weightCol)) {
+  weightCol <- ""
+}
+
+jobj <- 
callJStatic("org.apache.spark.ml.r.IsotonicRegressionWrapper", "fit",
+data@sdf, formula, as.logical(isotonic), 
as.integer(featureIndex),
+  as.character(weightCol))
+return(new("IsotonicRegressionModel", jobj = jobj))
+  })
+
+#  Predicted values based on an isotonicRegression model
+
+#' @param object a fitted IsotonicRegressionModel
+#' @param newData SparkDataFrame for testing
+#' @return \code{predict} returns a SparkDataFrame containing predicted 
values
+#' @rdname spark.isoreg
+#' @aliases predict,SparkDataFrame,SparkDataFrame-method
--- End diff --

`#' @aliases predict, IsotonicRegressionModel,SparkDataFrame-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74882538
  
--- Diff: R/pkg/R/mllib.R ---
@@ -717,8 +717,9 @@ setMethod("spark.gaussianMixture", signature(data = 
"SparkDataFrame", formula =
 
 #  Get the summary of a multivariate gaussian mixture model
 
-#' @param object A fitted gaussian mixture model
-#' @return \code{summary} returns the model's lambda, mu, sigma and 
posterior
+#' @param object a fitted gaussian mixture model.
+#' @param ... additional argument(s) passed to the method.
--- End diff --

I'd say "Currently not used" instead for this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14392
  
only one last comment, LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14558
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63830/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14558
  
**[Test build #63830 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63830/consoleFull)**
 for PR 14558 at commit 
[`e5771a1`](https://github.com/apache/spark/commit/e5771a1d18366b2b4ace2ffba49bbe89312e0acd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-08-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r74881945
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
 ---
@@ -25,55 +25,57 @@
 import org.apache.spark.sql.types.*;
 import org.apache.spark.unsafe.Platform;
 import org.apache.spark.unsafe.array.ByteArrayMethods;
+import org.apache.spark.unsafe.bitset.BitSetMethods;
 import org.apache.spark.unsafe.hash.Murmur3_x86_32;
 import org.apache.spark.unsafe.types.CalendarInterval;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
  * An Unsafe implementation of Array which is backed by raw memory instead 
of Java objects.
  *
- * Each tuple has three parts: [numElements] [offsets] [values]
+ * Each array has four parts: [numElements][null bits][values or 
offset][variable length portion]
  *
- * The `numElements` is 4 bytes storing the number of elements of this 
array.
+ * The `numElements` is 8 bytes storing the number of elements of this 
array.
  *
- * In the `offsets` region, we store 4 bytes per element, represents the 
relative offset (w.r.t. the
- * base address of the array) of this element in `values` region. We can 
get the length of this
- * element by subtracting next offset.
- * Note that offset can by negative which means this element is null.
+ * In the `null bits` region, we store 1 bit per element, represents 
whether a element has null
+ * Its total size is ceil(numElements / 8) bytes, and  it is aligned to 
8-byte word boundaries.
  *
- * In the `values` region, we store the content of elements. As we can get 
length info, so elements
- * can be variable-length.
+ * In the `values or offset` region, we store the content of elements. For 
fields that hold
+ * fixed-length primitive types, such as long, double, or int, we store 
the value directly
+ * in the field. For fields with non-primitive or variable-length values, 
we store a relative
+ * offset (w.r.t. the base address of the array) that points to the 
beginning of
+ * the variable-length field into int. It can only be calculated by 
knowing the total bytes of
--- End diff --

i see, makes sense, thanks for the explanation!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14658: [WIP][SPARK-5928][SPARK-6238] Remote Shuffle Blocks cann...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14658
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14658: [WIP][SPARK-5928][SPARK-6238] Remote Shuffle Blocks cann...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14658
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63823/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-08-15 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13758
  
no, by generating unsafe array directly, we will have an additional copy. 
But I don't think that matters too much, and worth such a big change.

Anyway let's make the new unsafe array in first, and we can be back to this 
PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14658: [WIP][SPARK-5928][SPARK-6238] Remote Shuffle Blocks cann...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14658
  
**[Test build #63823 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63823/consoleFull)**
 for PR 14658 at commit 
[`443aa91`](https://github.com/apache/spark/commit/443aa91cfc2490be9733c78b7cd911f09bedfac6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class ChunkFetchInputStream extends InputStream `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14648: [SPARK-16995][SQL] TreeNodeException when flat ma...

2016-08-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14648#discussion_r74881208
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -727,6 +727,13 @@ object FoldablePropagation extends Rule[LogicalPlan] {
 case j @ Join(_, _, LeftOuter | RightOuter | FullOuter, _) =>
   stop = true
   j
+
+// Operators that operate on objects should only have expressions 
from encoders, which
+// should never have foldable expressions.
+case o: ObjectConsumer => o
--- End diff --

Ok. I will update once I return back to laptop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14648: [SPARK-16995][SQL] TreeNodeException when flat ma...

2016-08-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14648#discussion_r74880666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -727,6 +727,13 @@ object FoldablePropagation extends Rule[LogicalPlan] {
 case j @ Join(_, _, LeftOuter | RightOuter | FullOuter, _) =>
   stop = true
   j
+
+// Operators that operate on objects should only have expressions 
from encoders, which
+// should never have foldable expressions.
+case o: ObjectConsumer => o
--- End diff --

we should follow other cases, to set `stop` to `true`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14558: [SPARK-16508][SparkR] Fix warnings on undocumented/dupli...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14558
  
**[Test build #63830 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63830/consoleFull)**
 for PR 14558 at commit 
[`e5771a1`](https://github.com/apache/spark/commit/e5771a1d18366b2b4ace2ffba49bbe89312e0acd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14656: [SPARK-17069] Expose spark.range() as table-valued funct...

2016-08-15 Thread ericl

Github user ericl commented on the issue:

https://github.com/apache/spark/pull/14656
  
Updated to use catalyst type coercion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14656: [SPARK-17069] Expose spark.range() as table-valued funct...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14656
  
**[Test build #63829 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63829/consoleFull)**
 for PR 14656 at commit 
[`2f80f54`](https://github.com/apache/spark/commit/2f80f549dd3d765fd3fdc63f9795d8e5562e38fa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14601: [SPARK-13979][Core][WIP]Killed executor is re spa...

2016-08-15 Thread agsachin

Github user agsachin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14601#discussion_r74880042
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -107,6 +107,14 @@ class SparkHadoopUtil extends Logging {
 if (key.startsWith("spark.hadoop.")) {
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread jodersky

Github user jodersky commented on the issue:

https://github.com/apache/spark/pull/14359
  
> I switched to Stack and then realized Stack has been deprecated in Scala 
2.11...

I think you probably read the *immutable* stack docs; the *mutable* stack 
is not deprecated AFAIK. I can imagine that having a custom stack 
implementation may allow for additional operations in the future, however we 
should also consider that using standard collections reduces the load for 
anyone who will maintain the code then.

Btw, I highly recommend to use the [milestone 
scaladocs](http://www.scala-lang.org/api/2.12.0-M5/scala/collection/mutable/Stack.html)
 over the current ones. Although 2.12 is not officially out yet, the changes to 
the library are minimal and the UI is much more pleasant to use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Thank you for reviews so far, @gatorsmile , @hvanhovell , @nsyca , @yhuai .
I'm closing this PR. I'm looking forward to see @gatorsmile 's PR and to 
get better master branch soon. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optim...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun closed the pull request at:

https://github.com/apache/spark/pull/14580


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-15 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r74879829
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,626 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is 1E

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Never mind. I always appreciate a lot for your review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
Sorry, I gave a wrong answer at the beginning. Next time, I will review it 
more carefully before leaving the comment. Thank you for your work! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14660: [SPARK-17071][SQL] Fetch Parquet schema without another ...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14660
  
**[Test build #63828 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63828/consoleFull)**
 for PR 14660 at commit 
[`e1214d5`](https://github.com/apache/spark/commit/e1214d50035441fb96551683cf38ae3e49f07b7d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
:) I think about this issue again.

At this stage, could you make a PR for this?
I think you're the best person to do that. You made this optimizer and 
found the correct fix.

This was a nice change to investigate this optimizer and nullability 
propagation for me.

@gatorsmile . Thank you for reviewing this. I'll close this PR soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...

2016-08-15 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/14660

[SPARK-17071][SQL] Fetch Parquet schema without another Spark job when it 
is a single file to touch 

## What changes were proposed in this pull request?

It seems Spark executes another job to figure out schema always 
([ParquetFileFormat#L739-L778](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L739-L778)).

However, it seems it's a bit of overhead to touch only a single file. I ran 
a bench mark with the code below:

```scala
test("Benchmark for JSON writer") {
  withTempPath { path =>
Seq((1, 2D, 3L, "4")).toDF("a", "b", "c", "d")
  .write.format("parquet").save(path.getAbsolutePath)

val benchmark = new Benchmark("Parquet - read schema", 1)
benchmark.addCase("Parquet - read schema", 10) { _ =>
  spark.read.format("parquet").load(path.getCanonicalPath).schema
}
benchmark.run()
  }
}
```

with the results as below:

- **Before**

  ```scala
  Parquet - read schema:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
  

  Parquet - read schema   47 /   49  0.0
46728419.0   1.0X
  ```

- **After**

  ```scala
  Parquet - read schema:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
  

  Parquet - read schema2 /3  0.0
 1811673.0   1.0X
  ```

It seems it became 20X faster (although It is a small bit in total job 
run-time).

As a reference, it seems ORC is doing this within driver-side 
[OrcFileOperator.scala#L74-L83](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L83).

## How was this patch tested?

Existing tests should cover this



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-17071

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14660.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14660


commit 614abbc6b7a03ff0d3e505697c0bbfec3b330c2b
Author: hyukjinkwon 
Date:   2016-08-16T05:42:29Z

Fetch Parquet schema within driver-side when there is single file to touch 
without another Spark job

commit e1214d50035441fb96551683cf38ae3e49f07b7d
Author: hyukjinkwon 
Date:   2016-08-16T05:46:12Z

Fix modifier




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...

2016-08-15 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/13796
  
@sethah Thank you for great work. I'll make another pass tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
One more try:
```Scala
val splitConjunctiveConditions: Seq[Expression] = 
splitConjunctivePredicates(filter.condition)
val conditions = splitConjunctiveConditions ++ filter.constraints
val leftConditions = 
conditions.filter(_.references.subsetOf(join.left.outputSet))
val rightConditions = 
conditions.filter(_.references.subsetOf(join.right.outputSet))

val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull)
```

Does this have a hole?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14616
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63821/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14616
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
Another version. : )

```Scala
val splitConjunctiveConditions: Seq[Expression] = 
splitConjunctivePredicates(filter.condition)
val conditions =
  splitConjunctiveConditions ++ 
filter.constraints.filter(_.isInstanceOf[IsNotNull])
val leftConditions = 
conditions.filter(_.references.subsetOf(join.left.outputSet))
val rightConditions = 
conditions.filter(_.references.subsetOf(join.right.outputSet))

val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14616
  
**[Test build #63821 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63821/consoleFull)**
 for PR 14616 at commit 
[`db84e25`](https://github.com/apache/spark/commit/db84e259749e6b339367fd42305f92a224407399).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14392
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63827/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14392
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14392
  
**[Test build #63827 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63827/consoleFull)**
 for PR 14392 at commit 
[`05afe23`](https://github.com/apache/spark/commit/05afe2342648160165722f483cd69251826cb68e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
How about another version?
```
val leftConditions =
  (splitConjunctiveConditions ++ 
filter.constraints.filter(_.isInstanceOf[IsNotNull]))
.filter(_.references.subsetOf(join.left.outputSet))
val rightConditions =
  (splitConjunctiveConditions ++ 
filter.constraints.filter(_.isInstanceOf[IsNotNull]))
.filter(_.references.subsetOf(join.right.outputSet))

val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Oh, that would be perfect fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63826/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63826 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63826/consoleFull)**
 for PR 14182 at commit 
[`fa69bc6`](https://github.com/apache/spark/commit/fa69bc6a045322de52e55666bcc2a04cd8486b36).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
How about this fix?
```
val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) ||
  filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => expr.references.subsetOf(join.left.outputSet) && 
canFilterOutNull(expr))
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) 
||
  filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => expr.references.subsetOf(join.right.outputSet) && 
canFilterOutNull(expr))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13796: [SPARK-7159][ML] Add multiclass logistic regression to S...

2016-08-15 Thread sethah

Github user sethah commented on the issue:

https://github.com/apache/spark/pull/13796
  
@dbtsai Thanks for taking the time to review this! Major items right now:

* Adding derivation to the aggregator doc (this is mostly finished, just 
fighting scala doc with Latex)
* Deciding whether to add initial model and tests in this PR or as a follow 
up
* Refactoring logistic regression helper classes to a separate file

Let me know if you see anything else.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14392
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63825/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14392
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14392
  
**[Test build #63825 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63825/consoleFull)**
 for PR 14392 at commit 
[`cc708b5`](https://github.com/apache/spark/commit/cc708b549455ad1d850e86198a84060086d30386).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13796: [SPARK-7159][ML] Add multiclass logistic regressi...

2016-08-15 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/13796#discussion_r74876946
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala
 ---
@@ -0,0 +1,626 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, LBFGS => BreezeLBFGS, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.types.DoubleType
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for multinomial logistic regression.
+ */
+private[classification] trait MultinomialLogisticRegressionParams
+  extends ProbabilisticClassifierParams with HasRegParam with 
HasElasticNetParam with HasMaxIter
+with HasFitIntercept with HasTol with HasStandardization with 
HasWeightCol {
+
+  /**
+   * Set thresholds in multiclass (or binary) classification to adjust the 
probability of
+   * predicting each class. Array must have length equal to the number of 
classes, with values >= 0.
+   * The class with largest value p/t is predicted, where p is the 
original probability of that
+   * class and t is the class' threshold.
+   *
+   * @group setParam
+   */
+  def setThresholds(value: Array[Double]): this.type = {
+set(thresholds, value)
+  }
+
+  /**
+   * Get thresholds for binary or multiclass classification.
+   *
+   * @group getParam
+   */
+  override def getThresholds: Array[Double] = {
+$(thresholds)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Multinomial Logistic regression.
+ */
+@Since("2.1.0")
+@Experimental
+class MultinomialLogisticRegression @Since("2.1.0") (
+@Since("2.1.0") override val uid: String)
+  extends ProbabilisticClassifier[Vector,
+MultinomialLogisticRegression, MultinomialLogisticRegressionModel]
+with MultinomialLogisticRegressionParams with DefaultParamsWritable 
with Logging {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("mlogreg"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the ElasticNet mixing parameter.
+   * For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.
+   * For 0 < alpha < 1, the penalty is a combination of L1 and L2.
+   * Default is 0.0 which is an L2 penalty.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setElasticNetParam(value: Double): this.type = set(elasticNetParam, 
value)
+
+  setDefault(elasticNetParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.1.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+
+  setDefault(maxIter -> 100)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy with the cost of more 
iterations.
+   * Default is 1E

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14182
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63824/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63824 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63824/consoleFull)**
 for PR 14182 at commit 
[`8844961`](https://github.com/apache/spark/commit/884496153f9aa512bc437c1c23361479b6b2bc7b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14359
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63822/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14359
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14359
  
**[Test build #63822 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)**
 for PR 14359 at commit 
[`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14392
  
**[Test build #63827 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63827/consoleFull)**
 for PR 14392 at commit 
[`05afe23`](https://github.com/apache/spark/commit/05afe2342648160165722f483cd69251826cb68e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
Another better fix is to use `nullable` in `Expression` for `IsNotNull` 
constraints. `filter.constraints.filter(_.isInstanceOf[IsNotNull])`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
`canFilterOutNull ` will cover almost all the cases. Sorry, I did not read 
the plan until you asked me to write a test case. Then, I realized the 
implementation of natural/using join is just using `coalesce`. As @hvanhovell 
and @nsyca said, that is just a syntactic sugar. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63826 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63826/consoleFull)**
 for PR 14182 at commit 
[`fa69bc6`](https://github.com/apache/spark/commit/fa69bc6a045322de52e55666bcc2a04cd8486b36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14506: [SPARK-16916][SQL] serde/storage properties shoul...

2016-08-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14506


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14659
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Please let me think more on this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-08-15 Thread Sherry302

GitHub user Sherry302 opened a pull request:

https://github.com/apache/spark/pull/14659

[SPARK-16757] Set up Spark caller context to HDFS

## What changes were proposed in this pull request?

1. Pass `jobId` to Task.
2. Invoke Hadoop APIs. 

A new function `setCallerContext` is added in `Utils`. `setCallerContext` 
function invokes APIs of   `org.apache.hadoop.ipc.CallerContext` to set up 
spark caller contexts, which will be written into `hdfs-audit.log`.

For applications in Yarn client mode, `org.apache.hadoop.ipc.CallerContext` 
are called in `Task` and Yarn `Client`. For applications in Yarn cluster mode, 
`org.apache.hadoop.ipc.CallerContext` are be called in `Task` and 
`ApplicationMaster`.

The Spark caller contexts written into `hdfs-audit.log` are applications' 
name` {spark.app.name}` and `JobID_stageID_stageAttemptId_taskID_attemptNumbe`.

## How was this patch tested?
Manual Tests against some Spark applications in Yarn client mode and Yarn 
cluster mode. Need to check if spark caller contexts are written into HDFS 
hdfs-audit.log successfully.

For example, run SparkKmeans in Yarn client mode: 
`./bin/spark-submit  --master yarn --deploy-mode client --class 
org.apache.spark.examples.SparkKMeans 
examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar 
hdfs://localhost:9000/lr_big.txt 2 5`

Before:
There will be no Spark caller context in records of `hdfs-audit.log`.

After:
Spark caller contexts will be in records of `hdfs-audit.log`.
(_Note: spark caller context below since Hadoop caller context API was 
invoked in Yarn Client_)
`2016-07-21 13:52:30,802 INFO FSNamesystem.audit: allowed=true
ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo
src=/lr_big.txtdst=nullperm=nullproto=rpc
callerContext=SparkKMeans running on Spark 
`
(_Note: spark caller context below since Hadoop caller context API was 
invoked in Task_)
`2016-07-21 13:52:35,584 INFO FSNamesystem.audit: allowed=true
ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=open
src=/lr_big.txtdst=nullperm=nullproto=rpc
callerContext=JobId_0_StageID_0_stageAttemptId_0_taskID_0_attemptNumber_0`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Sherry302/spark callercontextSubmit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14659.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14659


commit ec6833d32ef14950b2d81790bc908992f6288815
Author: Weiqing Yang 
Date:   2016-08-16T04:11:41Z

[SPARK-16757] Set up Spark caller context to HDFS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
Yep. I agree. `Expr` could be anything. However, this will reduce the scope 
of this optimization greatly. Is it okay for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14506
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14506
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63818/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14506
  
Thanks. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
If that is not applicable, I agree with @gatorsmile .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
That just resolves a specific case. The expressions could be much more 
complex. `Coalesce` can be used in a very deep layer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14506: [SPARK-16916][SQL] serde/storage properties should not h...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14506
  
**[Test build #63818 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63818/consoleFull)**
 for PR 14506 at commit 
[`3042af2`](https://github.com/apache/spark/commit/3042af2f0e9ae82e40d14e950a1036b9e417dbc9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14580
  
What about this if we could exclude those functions?
```scala
 val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) 
||
   filter.constraints.filter(_.isInstanceOf[IsNotNull])
-.exists(expr => 
join.left.outputSet.intersect(expr.references).nonEmpty)
+.exists(expr => !expr.isInstanceOf[Coalesce] &&
+  leftOuterAttributeSet.intersect(expr.references).nonEmpty)
 val rightHasNonNullPredicate = 
rightConditions.exists(canFilterOutNull) ||
   filter.constraints.filter(_.isInstanceOf[IsNotNull])
-.exists(expr => 
join.right.outputSet.intersect(expr.references).nonEmpty)
+.exists(expr => !expr.isInstanceOf[Coalesce] &&
+  rightOuterAttributeSet.intersect(expr.references).nonEmpty)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14447
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14447
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63820/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14447
  
**[Test build #63820 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63820/consoleFull)**
 for PR 14447 at commit 
[`7c94e2b`](https://github.com/apache/spark/commit/7c94e2ba11655cbd9275793f6c069ab3ba844238).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-15 Thread junyangq

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74874929
  
--- Diff: R/pkg/R/functions.R ---
@@ -1143,7 +1139,7 @@ setMethod("minute",
 #' @export
 #' @examples \dontrun{select(df, monotonically_increasing_id())}
 setMethod("monotonically_increasing_id",
-  signature(x = "missing"),
+  signature(),
--- End diff --

Automatic generation of S4 methods is not desirable. I hope this case can 
be better handled by roxygen. For now, I agree (b) is a good solution to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
The right fix is to change the following statements 
```Scala
val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull) ||
  filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => 
join.left.outputSet.intersect(expr.references).nonEmpty)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull) 
||
  filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => 
join.right.outputSet.intersect(expr.references).nonEmpty)
```
to the following ones:
```Scala
val leftHasNonNullPredicate = leftConditions.exists(canFilterOutNull)
val rightHasNonNullPredicate = rightConditions.exists(canFilterOutNull)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
Sorry, my above description is not clear. 

`isnotnull(coalesce(b#227, c#238))` does not filter out `NULL` of `b#227` 
and `c#238`. Only when both are `b#227` and `c#238` are `NULL`, 
`coalesce(b#227, c#238)` returns `NULL`. Thus, we are unable to use the 
following two statements to conclude whether left or right has Non-Null 
predicates.

```Scala
filter.constraints.filter(_.isInstanceOf[IsNotNull])
  .exists(expr => join.left.outputSet.intersect(expr.references).nonEmpty)
```
and 
```Scala
filter.constraints.filter(_.isInstanceOf[IsNotNull])
  .exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Model wrapp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14392
  
**[Test build #63825 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63825/consoleFull)**
 for PR 14392 at commit 
[`cc708b5`](https://github.com/apache/spark/commit/cc708b549455ad1d850e86198a84060086d30386).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/14580
  
Can you explain `isnotnull(coalesce(b#227, c#238)) does not filter out 
NULL!!!`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/14359
  
Btw, to give back-of-the-envelope estimates, we can look at 2 numbers:
(1) How many nodes will be split on each iteration?
(2) How big is the forest which is serialized and sent to workers on each 
iteration?

For (1), here's an example:
* 1000 features, each with 50 bins -> 5 possible splits
* set maxMemoryInMB = 256 (default)
* regression => 3 Double values per possible split
* 256 * 10^6 / (3 * 5 * 8) = 213 nodes/iteration

This implies that for trees of depth > 8 or so, many iterations will only 
split nodes from 1 or 2 trees.  I.e., we should avoid communicating most trees.

For (2), the forest can be pretty expensive to send.
* Each node:
  * leaf node: 5 Doubles
  * internal node: ~8 Doubles/references + Split
* Split: O(# categories) or 2 values for continuous, say 3 Doubles on 
average
  * => say 8 Doubles/node on average
* 100 trees of depth 8 => 25600 nodes => 1.6MB
* 100 trees of depth 14 => 105MB
* I've heard of many cases of users wanting to fit 500-1000 trees and use 
trees of depth 18-20.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14182: [SPARK-16444][SparkR]: Isotonic Regression wrapper in Sp...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14182
  
**[Test build #63824 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63824/consoleFull)**
 for PR 14182 at commit 
[`8844961`](https://github.com/apache/spark/commit/884496153f9aa512bc437c1c23361479b6b2bc7b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14658: [WIP][SPARK-5928] Remote Shuffle Blocks cannot be more t...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14658
  
**[Test build #63823 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63823/consoleFull)**
 for PR 14658 at commit 
[`443aa91`](https://github.com/apache/spark/commit/443aa91cfc2490be9733c78b7cd911f09bedfac6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
```Scala
val df12 = df1.join(df2, $"df1.a" === $"df2.a", "fullouter")
  .select(coalesce($"df1.b", $"df2.c").as("a"), $"df1.b", $"df2.c")
df12.join(df3, "a").explain(true)
```

This is an example to show that we should not eliminate the outer join, 
even if `isnotnull(coalesce(b#227, c#238))` contains the attributes that are 
not in join conditions. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-15 Thread junyangq

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74874081
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -181,7 +181,7 @@ getDefaultSqlSource <- function() {
 #' @method createDataFrame default
 #' @note createDataFrame since 1.4.0
 # TODO(davies): support sampling and infer type from NA
-createDataFrame.default <- function(data, schema = NULL, samplingRatio = 
1.0) {
+createDataFrame.default <- function(data, schema = NULL) {
--- End diff --

Oh yes... Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14658: [WIP][SPARK-5928] Remote Shuffle Blocks cannot be...

2016-08-15 Thread witgo

GitHub user witgo opened a pull request:

https://github.com/apache/spark/pull/14658

[WIP][SPARK-5928] Remote Shuffle Blocks cannot be more than 2 GB

## What changes were proposed in this pull request?

Add class `ChunkFetchInputStream` and it have the following effects:
1. flow control
[WIP]
2. reduce memory usage
[WIP]
3. unlimited size 
[WIP]
## How was this patch tested?
WIP



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark SPARK-5928_Shuffle_Blocks_2G

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14658.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14658


commit 443aa91cfc2490be9733c78b7cd911f09bedfac6
Author: Guoqiang Li 
Date:   2016-08-16T04:00:10Z

Remote Shuffle Blocks cannot be more than 2 GB




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14392: [SPARK-16446] [SparkR] [ML] Gaussian Mixture Mode...

2016-08-15 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/14392#discussion_r74873932
  
--- Diff: R/pkg/R/generics.R ---
@@ -1279,6 +1279,13 @@ setGeneric("spark.naiveBayes", function(data, 
formula, ...) { standardGeneric("s
 #' @export
 setGeneric("spark.survreg", function(data, formula, ...) { 
standardGeneric("spark.survreg") })
 
+#' @rdname spark.gaussianMixture
+#' @export
+setGeneric("spark.gaussianMixture",
+   function(data, formula, ...) {
+ standardGeneric("spark.gaussianMixture")
--- End diff --

It can not fit one line, since ```lint-r``` requires lines should not be 
more than 100 characters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14558: [SPARK-16508][SparkR] Fix warnings on undocumente...

2016-08-15 Thread junyangq

Github user junyangq commented on a diff in the pull request:

https://github.com/apache/spark/pull/14558#discussion_r74873867
  
--- Diff: R/pkg/R/mllib.R ---
@@ -298,14 +304,15 @@ setMethod("summary", signature(object = 
"NaiveBayesModel"),
 #' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
 #' predictions on new data, and \code{write.ml}/\code{read.ml} to 
save/load fitted models.
 #'
-#' @param data SparkDataFrame for training
-#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#' @param data a SparkDataFrame for training.
+#' @param formula a symbolic description of the model to be fitted. 
Currently only a few formula
 #'operators are supported, including '~', '.', ':', '+', 
and '-'.
 #'Note that the response variable of formula is empty in 
spark.kmeans.
-#' @param k Number of centers
-#' @param maxIter Maximum iteration number
-#' @param initMode The initialization algorithm choosen to fit the model
-#' @return \code{spark.kmeans} returns a fitted k-means model
+#' @param ... additional argument(s) passed to the method.
--- End diff --

Yeah agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14628: [SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to ...

2016-08-15 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14628
  
@holdenk 
I think depth (2) is enough to handle large RDD and bigger depth may add 
cost. I'll append test result later. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14359
  
**[Test build #63822 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63822/consoleFull)**
 for PR 14359 at commit 
[`f79f77c`](https://github.com/apache/spark/commit/f79f77ce49aa797e8432b56fd2ad115540be67cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
None of us is right. : (

```isnotnull(coalesce(b#227, c#238))``` does not filter out `NULL`!!!

Thus, the right fix is to remove the second condition. 
```Scala
filter.constraints.filter(_.isInstanceOf[IsNotNull]).exists(expr => 
join.left.outputSet.intersect(expr.references).nonEmpty)
```
and 
```Scala
filter.constraints.filter(_.isInstanceOf[IsNotNull])
.exists(expr => join.right.outputSet.intersect(expr.references).nonEmpty
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14359: [SPARK-16719][ML] Random Forests should communicate fewe...

2016-08-15 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/14359
  
Sorry for the long delay; I've been swamped by other things for a while.  
Re-emerging...

I switched to Stack and then realized Stack has been deprecated in Scala 
2.11, so I reverted to the original NodeQueue.  But I renamed NodeQueue to 
NodeStack to be a bit clearer.

@hhbyyh Any luck testing this at scale?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14580: [SPARK-16991][SQL] Fix `EliminateOuterJoin` optimizer to...

2016-08-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14580
  
I found the root cause. None of us is right. : (


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14647: [WIP][Test only][DEMO][SPARK-6235]Address various 2G lim...

2016-08-15 Thread witgo

Github user witgo commented on the issue:

https://github.com/apache/spark/pull/14647
  
@hvanhovell  
I will submit some small PRs and provide a more high level description of 
them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-08-15 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13758
  
You are right. I missed `UnsafeArrayData` is a subclass of `ArrayData`. We 
can pass `UnsafeArrayData` to an projection.

I have one question.
When we directly generate `UnsafeArrayData` from a primitive array and copy 
it into an `InternalRow` (`serializefromobject_result`), the following two 
operations are required:
1. Copy from a primitive array to `UnsafeArrayData`
2. Copy from `UnsafeArrayData` into `InternalRow` at line 102
On the other hand, this PR requires the following one operation
0. (No copy happens at line 086 since this PR just store a reference to a 
primitive array in `GenericArrayData`)
1. Copy from a primitive array to `InternalRow`  ([this 
PR](https://github.com/apache/spark/pull/13911) performs `Platform.copy Memory` 
without no iteration.

Can we avoid additional copy at 2. when we directly generate 
`UnsafeArrayData` from a primitive array?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14616: [SPARK-17034][SQL] adds expression UnresolvedOrdinal to ...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14616
  
**[Test build #63821 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63821/consoleFull)**
 for PR 14616 at commit 
[`db84e25`](https://github.com/apache/spark/commit/db84e259749e6b339367fd42305f92a224407399).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-15 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14649
  
Also, if my understanding is correct, we are picking up only single file to 
read footer (see 
[ParquetFileFormat.scala#L217-L225](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L217-L225))
 unless we merge schemas. So, it seems, due to this reason, writing `_metadata` 
or `_common_metadata` is disabled (See 
https://issues.apache.org/jira/browse/SPARK-15719).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-08-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14649#discussion_r74872775
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -423,6 +425,54 @@ class ParquetFileFormat
   sqlContext.sessionState.newHadoopConf(),
   options)
   }
+
+  override def filterPartitions(
+  filters: Seq[Filter],
+  schema: StructType,
+  conf: Configuration,
+  allFiles: Seq[FileStatus],
+  root: Path,
+  partitions: Seq[Partition]): Seq[Partition] = {
+// Read the "_metadata" file if available, contains all block headers. 
On S3 better to grab
+// all of the footers in a batch rather than having to read every 
single file just to get its
+// footer.
+allFiles.find(_.getPath.getName == 
ParquetFileWriter.PARQUET_METADATA_FILE).map { stat =>
+  val metadata = ParquetFileReader.readFooter(conf, stat, 
ParquetMetadataConverter.NO_FILTER)
+  partitions.map { part =>
+filterByMetadata(
+  filters,
+  schema,
+  conf,
+  root,
+  metadata,
+  part)
+  }.filterNot(_.files.isEmpty)
+}.getOrElse(partitions)
+  }
+
+  private def filterByMetadata(
+  filters: Seq[Filter],
+  schema: StructType,
+  conf: Configuration,
+  root: Path,
+  metadata: ParquetMetadata,
+  partition: Partition): Partition = {
+val blockMetadatas = metadata.getBlocks.asScala
+val parquetSchema = metadata.getFileMetaData.getSchema
+val conjunctiveFilter = filters
+  .flatMap(ParquetFilters.createFilter(schema, _))
+  .reduceOption(FilterApi.and)
+conjunctiveFilter.map { conjunction =>
+  val filteredBlocks = RowGroupFilter.filterRowGroups(
--- End diff --

Do you mind if I ask a question please?

So, if my understanding is correct, Parquet filters rowgroups in both 
normal reader and vectorized reader already 
(https://github.com/apache/spark/pull/13701). Is this doing the same thing in 
Spark-side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-08-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14649#discussion_r74872795
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -423,6 +425,54 @@ class ParquetFileFormat
   sqlContext.sessionState.newHadoopConf(),
   options)
   }
+
+  override def filterPartitions(
+  filters: Seq[Filter],
+  schema: StructType,
+  conf: Configuration,
+  allFiles: Seq[FileStatus],
+  root: Path,
+  partitions: Seq[Partition]): Seq[Partition] = {
+// Read the "_metadata" file if available, contains all block headers. 
On S3 better to grab
+// all of the footers in a batch rather than having to read every 
single file just to get its
+// footer.
+allFiles.find(_.getPath.getName == 
ParquetFileWriter.PARQUET_METADATA_FILE).map { stat =>
+  val metadata = ParquetFileReader.readFooter(conf, stat, 
ParquetMetadataConverter.NO_FILTER)
+  partitions.map { part =>
+filterByMetadata(
+  filters,
+  schema,
+  conf,
+  root,
+  metadata,
+  part)
+  }.filterNot(_.files.isEmpty)
+}.getOrElse(partitions)
+  }
+
+  private def filterByMetadata(
+  filters: Seq[Filter],
+  schema: StructType,
+  conf: Configuration,
+  root: Path,
+  metadata: ParquetMetadata,
+  partition: Partition): Partition = {
+val blockMetadatas = metadata.getBlocks.asScala
+val parquetSchema = metadata.getFileMetaData.getSchema
+val conjunctiveFilter = filters
+  .flatMap(ParquetFilters.createFilter(schema, _))
+  .reduceOption(FilterApi.and)
+conjunctiveFilter.map { conjunction =>
+  val filteredBlocks = RowGroupFilter.filterRowGroups(
--- End diff --

Also, doesn't this try to touch many files in driver-side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14447: [SPARK-16445][MLlib][SparkR] Multilayer Perceptron Class...

2016-08-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14447
  
**[Test build #63820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63820/consoleFull)**
 for PR 14447 at commit 
[`7c94e2b`](https://github.com/apache/spark/commit/7c94e2ba11655cbd9275793f6c069ab3ba844238).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 514 matches

Mail list logo