date:20161007

[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...

2016-10-07 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/9766#discussion_r82450939
  
--- Diff: python/pyspark/sql/context.py ---
@@ -202,6 +202,10 @@ def registerFunction(self, name, f, 
returnType=StringType()):
 """
 self.sparkSession.catalog.registerFunction(name, f, returnType)
 
+def registerJavaFunction(self, name, javaClassName, returnType):
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14963: [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in...

2016-10-07 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14963
  
I think this could be a good change for allowing more developers to onboard 
with PySpark - is there any interest in the current PySpark/Build focused 
committers [ @davies @srowen @rxin ] in seeing this change (or something 
similar)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15377
  
**[Test build #66506 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66506/consoleFull)**
 for PR 15377 at commit 
[`f1962e4`](https://github.com/apache/spark/commit/f1962e41981c80bb366a8a2060edadee59f95022).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...

2016-10-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/15393#discussion_r82450501
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -70,7 +70,7 @@ class JdbcRelationProvider extends 
CreatableRelationProvider
   if (tableExists) {
 mode match {
   case SaveMode.Overwrite =>
-if (isTruncate && 
isCascadingTruncateTable(url).contains(false)) {
+if (isTruncate && isCascadingTruncateTable(url).exists(_ == 
false)) {
--- End diff --

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...

2016-10-07 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15239#discussion_r82450472
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -341,11 +342,13 @@ setMethod("toDF", signature(x = "RDD"),
 #' @name read.json
 #' @method read.json default
 #' @note read.json since 1.6.0
-read.json.default <- function(path) {
+read.json.default <- function(path, ...) {
   sparkSession <- getSparkSession()
+  options <- varargsToStrEnv(...)
   # Allow the user to have a more flexible definiton of the text file path
   paths <- as.list(suppressWarnings(normalizePath(path)))
   read <- callJMethod(sparkSession, "read")
+  read <- callJMethod(read, "options", options)
--- End diff --

Yeap, let me try to organise the unsolved comments here and 
https://github.com/apache/spark/issues/15231 if there is any! Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11601: [SPARK-13568] [ML] Create feature transformer to impute ...

2016-10-07 Thread hhbyyh

Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/11601
  
Thanks for the comments @MLnick @jkbradley @sethah 
I have sent update according to the comments and change 
`ImputerModel.surrogate` and persistence format into DataFrame. 

As for the multiple columns support, do we need to assemble the imputed 
column together into one output column or should we support multiple output 
columns?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15377
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15377
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66507/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...

2016-10-07 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/9766
  
Maybe @marmbrus could take a look if @davies is busy?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15393: [HOTFIX][BUILD] Do not use contains in Option in JdbcRel...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15393
  
**[Test build #66521 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66521/consoleFull)**
 for PR 15393 at commit 
[`d052ab6`](https://github.com/apache/spark/commit/d052ab663e0b6c246070e7dd2d4728a9708c8ac9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15377
  
**[Test build #66507 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66507/consoleFull)**
 for PR 15377 at commit 
[`f6557d6`](https://github.com/apache/spark/commit/f6557d6e2fd79cf778966b0665fdd493a0de0b0e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models should cont...

2016-10-07 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14653
  
Huh I'm not sure why jenkins isn't picking this up - @jkbradley or 
@davidnavas can you tell jenkins this is ok to test again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...

2016-10-07 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15393#discussion_r82449644
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -70,7 +70,7 @@ class JdbcRelationProvider extends 
CreatableRelationProvider
   if (tableExists) {
 mode match {
   case SaveMode.Overwrite =>
-if (isTruncate && 
isCascadingTruncateTable(url).contains(false)) {
+if (isTruncate && isCascadingTruncateTable(url).exists(_ == 
false)) {
--- End diff --

Oh, yes I think that looks nicer. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15355: [SPARK-17782][STREAMING][BUILD] Add Kafka 0.10 project t...

2016-10-07 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15355
  
Also check-picked this one into branch 2.0 since it's also helpful for 2.0 
backport PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...

2016-10-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/15393#discussion_r82449290
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -70,7 +70,7 @@ class JdbcRelationProvider extends 
CreatableRelationProvider
   if (tableExists) {
 mode match {
   case SaveMode.Overwrite =>
-if (isTruncate && 
isCascadingTruncateTable(url).contains(false)) {
+if (isTruncate && isCascadingTruncateTable(url).exists(_ == 
false)) {
--- End diff --

Is the original one better? I mean '== Some(false)'?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...

2016-10-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/15393#discussion_r82449361
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -70,7 +70,7 @@ class JdbcRelationProvider extends 
CreatableRelationProvider
   if (tableExists) {
 mode match {
   case SaveMode.Overwrite =>
-if (isTruncate && 
isCascadingTruncateTable(url).contains(false)) {
+if (isTruncate && isCascadingTruncateTable(url).exists(_ == 
false)) {
--- End diff --

The current one also looks good to me. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15384: [SPARK-17346][SQL][Tests]Fix the flaky topic deletion in...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15384
  
**[Test build #66520 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66520/consoleFull)**
 for PR 15384 at commit 
[`0addb26`](https://github.com/apache/spark/commit/0addb262bbfdf4a260f5a9c5686903a900743533).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15384: [SPARK-17346][SQL][Tests]Fix the flaky topic deletion in...

2016-10-07 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15384
  
Since deleting a normal topic may be timeout as well: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/1763/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/_It_is_not_a_test_/

I just removed the topic deletion codes. It's not necessary since we are 
going to shut down the kafka cluster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...

2016-10-07 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15365
  
It would be great to get some feedback on the name `spark.logit`
What do folks think about it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15365: [SPARK-17157][SPARKR]: Add multiclass logistic re...

2016-10-07 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15365#discussion_r82448237
  
--- Diff: R/pkg/R/mllib.R ---
@@ -647,6 +654,195 @@ setMethod("predict", signature(object = 
"KMeansModel"),
 predict_internal(object, newData)
   })
 
+#' Logistic Regression Model
+#'
+#' Fits an logistic regression model against a Spark DataFrame. It 
supports "binomial": Binary logistic regression
+#' with pivoting; "multinomial": Multinomial logistic (softmax) regression 
without pivoting, similar to glmnet.
+#' Users can print, make predictions on the produced model and save the 
model to the input path.
+#'
+#' @param data SparkDataFrame for training
+#' @param formula A symbolic description of the model to be fitted. 
Currently only a few formula
+#'operators are supported, including '~', '.', ':', '+', 
and '-'.
+#' @param regParam the regularization parameter. Default is 0.0.
+#' @param elasticNetParam the ElasticNet mixing parameter. For alpha = 0, 
the penalty is an L2 penalty.
+#'For alpha = 1, it is an L1 penalty. For 0 < 
alpha < 1, the penalty is a combination
+#'of L1 and L2. Default is 0.0 which is an L2 
penalty.
+#' @param maxIter maximum iteration number.
+#' @param tol convergence tolerance of iterations.
+#' @param fitIntercept whether to fit an intercept term. Default is TRUE.
+#' @param family the name of family which is a description of the label 
distribution to be used in the model.
+#'   Supported options:
+#' - "auto": Automatically select the family based on the 
number of classes:
+#'   If numClasses == 1 || numClasses == 2, set to 
"binomial".
+#'   Else, set to "multinomial".
+#' - "binomial": Binary logistic regression with pivoting.
+#' - "multinomial": Multinomial logistic (softmax) 
regression without pivoting.
+#' Default is "auto".
+#' @param standardization whether to standardize the training features 
before fitting the model. The coefficients
+#'of models will be always returned on the 
original scale, so it will be transparent for
+#'users. Note that with/without standardization, 
the models should be always converged
+#'to the same solution when no regularization is 
applied. Default is TRUE, same as glmnet.
+#' @param threshold in binary classification, in range [0, 1]. If the 
estimated probability of class label 1
+#'  is > threshold, then predict 1, else 0. A high 
threshold encourages the model to predict 0
+#'  more often; a low threshold encourages the model to 
predict 1 more often. Note: Setting this with
+#'  threshold p is equivalent to setting thresholds 
(Array(1-p, p)). When threshold is set, any user-set
+#'  value for thresholds will be cleared. If both 
threshold and thresholds are set, then they must be
+#'  equivalent. Default is 0.5.
+#' @param thresholds in multiclass (or binary) classification to adjust 
the probability of predicting each class.
+#'   Array must have length equal to the number of 
classes, with values > 0, excepting that at most one
+#'   value may be 0. The class with largest value p/t is 
predicted, where p is the original probability
+#'   of that class and t is the class's threshold. Note: 
When thresholds is set, any user-set
+#'   value for threshold will be cleared. If both 
threshold and thresholds are set, then they must be
+#'   equivalent. Default is NULL.
+#' @param weightCol The weight column name.
+#' @param aggregationDepth depth for treeAggregate (>= 2). If the 
dimensions of features or the number of partitions
+#' are large, this param could be adjusted to a 
larger size. Default is 2.
+#' @param ... additional arguments passed to the method.
+#' @return \code{spark.logit} returns a fitted logistic regression model
+#' @rdname spark.logit
+#' @aliases spark.logit,SparkDataFrame,formula-method
+#' @name spark.logit
+#' @export
+#' @examples
+#' \dontrun{
+#' sparkR.session()
+#' # binary logistic regression
+#' label <- c(1.0, 1.0, 1.0, 0.0, 0.0)
+#' feature <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776)
+#' binary_data <- as.data.frame(cbind(label, feature))
+#' binary_df <- suppressWarnings(createDataFrame(binary_data))
--- End diff --

why is `suppressWarnings` needed here?


---
If your project is set up for it, you can reply to this email and

[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15375
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66517/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15375
  
**[Test build #66517 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66517/consoleFull)**
 for PR 15375 at commit 
[`6ea7fb3`](https://github.com/apache/spark/commit/6ea7fb3fd2c95c65ae557d8df2b954f25ef4ceca).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class InterfaceStability `
  * `  sealed trait ConsumerStrategy `
  * `  case class SubscribeStrategy(topics: Seq[String], kafkaParams: 
ju.Map[String, Object])`
  * `  case class SubscribePatternStrategy(`
  * `case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, 
Long]) extends Offset `
  * `abstract class StreamExecutionThread(name: String) extends 
UninterruptibleThread(name)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15375
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15393: [HOTFIX][BUILD] Do not use contains in Option in JdbcRel...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15393
  
**[Test build #66519 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66519/consoleFull)**
 for PR 15393 at commit 
[`83df63b`](https://github.com/apache/spark/commit/83df63b1303ee41838b3524737ce53ee2ac5e17f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...

2016-10-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15263
  
Oh, thank you for pointing this out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15393: [HOTFIX][BUILD] Do not use contains in Option in JdbcRel...

2016-10-07 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15393
  
@yhuai @zsxwing Do you mind if I ask to take a look please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15148
  
Related to the docs, some more comments defining terminology would be 
useful for non-experts:
* OR-amplification
* probing buckets
* false positives/negatives (w.r.t. finding nearest neighbors)

Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...

2016-10-07 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/15393

[HOTFIX][BUILD] Do not use contains in Option in JdbcRelationProvider

## What changes were proposed in this pull request?

This PR proposes the fix the use of `contains` API which only exists from 
Scala 2.11.

## How was this patch tested?

Manually checked the API in Scala - 
https://github.com/scala/scala/blob/2.10.x/src/library/scala/Option.scala#L218


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark hotfix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15393


commit 83df63b1303ee41838b3524737ce53ee2ac5e17f
Author: hyukjinkwon 
Date:   2016-10-07T18:40:04Z

Hotfix do not use contains in Option




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15375
  
**[Test build #66517 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66517/consoleFull)**
 for PR 15375 at commit 
[`6ea7fb3`](https://github.com/apache/spark/commit/6ea7fb3fd2c95c65ae557d8df2b954f25ef4ceca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15366
  
**[Test build #66518 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66518/consoleFull)**
 for PR 15366 at commit 
[`29b007b`](https://github.com/apache/spark/commit/29b007b68a9b1633b4191ef93d379c01fda2f393).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15367: [SPARK-17346][SQL][test-maven]Add Kafka source fo...

2016-10-07 Thread zsxwing

Github user zsxwing closed the pull request at:

https://github.com/apache/spark/pull/15367


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14561: [SPARK-16972][CORE] Move DriverEndpoint out of CoarseGra...

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14561
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15375: [SPARK-17790] Support for parallelizing R data.fr...

2016-10-07 Thread falaki

Github user falaki commented on a diff in the pull request:

https://github.com/apache/spark/pull/15375#discussion_r82446808
  
--- Diff: R/pkg/R/context.R ---
@@ -123,19 +126,48 @@ parallelize <- function(sc, coll, numSlices = 1) {
   if (numSlices > length(coll))
 numSlices <- length(coll)
 
+  sizeLimit <- as.numeric(sparkR.conf(
+  "spark.r.maxAllocationLimit",
+  toString(.Machine$integer.max / 2) # Default to a safe default: 200MB
+  ))
+  objectSize <- object.size(coll)
+
+  # For large objects we make sure the size of each slice is also smaller 
than sizeLimit
+  numSlices <- max(numSlices, ceiling(objectSize / sizeLimit))
+
   sliceLen <- ceiling(length(coll) / numSlices)
   slices <- split(coll, rep(1: (numSlices + 1), each = 
sliceLen)[1:length(coll)])
 
   # Serialize each slice: obtain a list of raws, or a list of lists 
(slices) of
   # 2-tuples of raws
   serializedSlices <- lapply(slices, serialize, connection = NULL)
 
-  jrdd <- callJStatic("org.apache.spark.api.r.RRDD",
-  "createRDDFromArray", sc, serializedSlices)
+  # The PRC backend cannot handle arguments larger than 2GB (INT_MAX)
+  # If serialized data is safely less than that threshold we send it over 
the PRC channel.
+  # Otherwise, we write it to a file and send the file name
+  if (objectSize < sizeLimit) {
+jrdd <- callJStatic("org.apache.spark.api.r.RRDD", 
"createRDDFromArray", sc, serializedSlices)
+  } else {
+fileName <- writeToTempFile(serializedSlices)
+jrdd <- callJStatic(
+  "org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, 
as.integer(numSlices))
+file.remove(fileName)
--- End diff --

Good point. Done!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...

2016-10-07 Thread ajbozarth

Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/15366
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445466
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Model produced by [[MinHash]]
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  Vectors.dense(hashFunctions.map(
+func => elems.toSparse.indices.toList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble
--- End diff --

Check for invalid input where the denominator is 0.  It looks like that 
cannot happen currently since hashFunction is always called first, but it'd be 
good to protect against future changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445905
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DataTypes
+
+private[ml] object LSHTest {
+  /**
+   * For any locality sensitive function h in a metric space, we meed to 
verify whether
+   * the following property is satisfied.
+   *
+   * There exist d1, d2, p1, p2, so that for any two elements e1 and e2,
--- End diff --

d1,d2 -> dist1,dist2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445623
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Model produced by [[MinHash]]
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  Vectors.dense(hashFunctions.map(
+func => elems.toSparse.indices.toList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * LSH class for Jaccard distance
+ * The input set should be represented in sparse vector form. For example,
+ *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])
+ * means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5
+ */
+@Experimental
+@Since("2.1.0")
+class MinHash private[ml] (override val uid: String) extends 
LSH[MinHashModel] {
+
+  private[this] val prime = 2038074743
--- End diff --

How was this prime chosen?  Some comment here for future developers would 
be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445705
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import breeze.linalg.normalize
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Params for [[RandomProjection]].
+ */
+@Experimental
--- End diff --

No need for Experimental or Since annotations for private trait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445915
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DataTypes
+
+private[ml] object LSHTest {
+  /**
+   * For any locality sensitive function h in a metric space, we meed to 
verify whether
+   * the following property is satisfied.
+   *
+   * There exist d1, d2, p1, p2, so that for any two elements e1 and e2,
+   * If dist(e1, e2) >= dist1, then Pr{h(x) == h(y)} >= p1
+   * If dist(e1, e2) <= dist2, then Pr{h(x) != h(y)} <= p2
--- End diff --

should be: "If dist(e1, e2) >= dist2, then Pr{h(x) == h(y)} <= p2"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445482
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Model produced by [[MinHash]]
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  Vectors.dense(hashFunctions.map(
+func => elems.toSparse.indices.toList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * LSH class for Jaccard distance
+ * The input set should be represented in sparse vector form. For example,
--- End diff --

Clarify: The input can be dense or sparse, but it is more efficient if it 
is sparse.  Also clarify that the non-zero indices matter but that all non-zero 
values are treated as binary "1" values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445651
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Model produced by [[MinHash]]
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  Vectors.dense(hashFunctions.map(
+func => elems.toSparse.indices.toList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * LSH class for Jaccard distance
+ * The input set should be represented in sparse vector form. For example,
+ *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])
+ * means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5
+ */
+@Experimental
+@Since("2.1.0")
+class MinHash private[ml] (override val uid: String) extends 
LSH[MinHashModel] {
+
+  private[this] val prime = 2038074743
+
+  @Since("2.1.0")
+  override def setInputCol(value: String): this.type = 
super.setInputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputCol(value: String): this.type = 
super.setOutputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputDim(value: Int): this.type = 
super.setOutputDim(value)
+
+  private[this] lazy val randSeq: Seq[Int] = {
+Seq.fill($(outputDim))(1 + Random.nextInt(prime - 
1)).take($(outputDim))
--- End diff --

Why have ```.take(...)```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445670
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Model produced by [[MinHash]]
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  Vectors.dense(hashFunctions.map(
+func => elems.toSparse.indices.toList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * LSH class for Jaccard distance
+ * The input set should be represented in sparse vector form. For example,
+ *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])
+ * means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5
+ */
+@Experimental
+@Since("2.1.0")
+class MinHash private[ml] (override val uid: String) extends 
LSH[MinHashModel] {
+
+  private[this] val prime = 2038074743
+
+  @Since("2.1.0")
+  override def setInputCol(value: String): this.type = 
super.setInputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputCol(value: String): this.type = 
super.setOutputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputDim(value: Int): this.type = 
super.setOutputDim(value)
+
+  private[this] lazy val randSeq: Seq[Int] = {
+Seq.fill($(outputDim))(1 + Random.nextInt(prime - 
1)).take($(outputDim))
+  }
+
+  @Since("2.1.0")
+  private[ml] def this() = {
+this(Identifiable.randomUID("min hash"))
+  }
+
+  @Since("2.1.0")
+  override protected[this] def createRawLSHModel(inputDim: Int): 
MinHashModel = {
+val numEntry = inputDim * 2
--- End diff --

This could overflow.  Use ```inputDim < prime / 2 + 1``` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445698
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Model produced by [[MinHash]]
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  Vectors.dense(hashFunctions.map(
+func => elems.toSparse.indices.toList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * LSH class for Jaccard distance
+ * The input set should be represented in sparse vector form. For example,
+ *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])
+ * means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5
+ */
+@Experimental
+@Since("2.1.0")
+class MinHash private[ml] (override val uid: String) extends 
LSH[MinHashModel] {
+
+  private[this] val prime = 2038074743
+
+  @Since("2.1.0")
+  override def setInputCol(value: String): this.type = 
super.setInputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputCol(value: String): this.type = 
super.setOutputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputDim(value: Int): this.type = 
super.setOutputDim(value)
+
+  private[this] lazy val randSeq: Seq[Int] = {
+Seq.fill($(outputDim))(1 + Random.nextInt(prime - 
1)).take($(outputDim))
+  }
+
+  @Since("2.1.0")
+  private[ml] def this() = {
+this(Identifiable.randomUID("min hash"))
+  }
+
+  @Since("2.1.0")
+  override protected[this] def createRawLSHModel(inputDim: Int): 
MinHashModel = {
+val numEntry = inputDim * 2
+require(numEntry < prime, "The input vector dimension is too large for 
MinHash to handle.")
+val hashFunctions: Seq[Int => Long] = {
+  (0 until $(outputDim)).map { i: Int =>
+// Perfect Hash function, use 2n buckets to reduce collision.
+elem: Int => (1 + elem) * randSeq(i).toLong % prime % numEntry
+  }
+}
+new MinHashModel(uid, hashFunctions)
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+require(schema.apply($(inputCol)).dataType.sameType(new VectorUDT),
--- End diff --

You can use ```SchemaUtils.checkColumnType```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail:

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445744
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import breeze.linalg.normalize
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Params for [[RandomProjection]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait RandomProjectionParams extends Params {
+  @Since("2.1.0")
+  val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength",
+"the length of each hash bucket, a larger bucket lowers the false 
negative rate.",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getBucketLength: Double = $(bucketLength)
+}
+
+/**
+ * Model produced by [[RandomProjection]]
--- End diff --

For Experimental classes, begin the Scaladoc with a line with:
```
:: Experimental ::
```

(See e.g. MultilayerPerceptronClassifier)

Also, add doc for randUnitVectors since it shows up in the API as member 
data:
```
@param randUnitVectors  ...
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445909
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DataTypes
+
+private[ml] object LSHTest {
+  /**
+   * For any locality sensitive function h in a metric space, we meed to 
verify whether
+   * the following property is satisfied.
+   *
+   * There exist d1, d2, p1, p2, so that for any two elements e1 and e2,
+   * If dist(e1, e2) >= dist1, then Pr{h(x) == h(y)} >= p1
--- End diff --

">=" should be "<="


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445506
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Model produced by [[MinHash]]
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  Vectors.dense(hashFunctions.map(
+func => elems.toSparse.indices.toList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * LSH class for Jaccard distance
+ * The input set should be represented in sparse vector form. For example,
+ *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])
--- End diff --

This will show up as code if you surround it with single back ticks:
```
`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445715
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import breeze.linalg.normalize
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Params for [[RandomProjection]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait RandomProjectionParams extends Params {
+  @Since("2.1.0")
--- End diff --

(But keep this annotation since bucketLength is made public in subclasses.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445919
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala ---
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.ml.linalg.Vector
+import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DataTypes
+
+private[ml] object LSHTest {
+  /**
+   * For any locality sensitive function h in a metric space, we meed to 
verify whether
+   * the following property is satisfied.
+   *
+   * There exist d1, d2, p1, p2, so that for any two elements e1 and e2,
+   * If dist(e1, e2) >= dist1, then Pr{h(x) == h(y)} >= p1
+   * If dist(e1, e2) <= dist2, then Pr{h(x) != h(y)} <= p2
+   *
+   * This is called locality sensitive property. This method checks the 
property on an
+   * existing dataset and calculate the probabilities.
+   * (https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Definition)
+   *
+   * @param dataset The dataset to verify the locality sensitive hashing 
property.
+   * @param lsh The lsh instance to perform the hashing
+   * @param dist1 Distance threshold for false positive
--- End diff --

Rename dist1,dist2 to distFP, distFN


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445897
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import breeze.linalg.normalize
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Params for [[RandomProjection]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait RandomProjectionParams extends Params {
+  @Since("2.1.0")
+  val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength",
+"the length of each hash bucket, a larger bucket lowers the false 
negative rate.",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getBucketLength: Double = $(bucketLength)
+}
+
+/**
+ * Model produced by [[RandomProjection]]
+ */
+@Experimental
+@Since("2.1.0")
+class RandomProjectionModel private[ml] (
+@Since("2.1.0") override val uid: String,
+@Since("2.1.0") val randUnitVectors: Array[Vector])
+  extends LSHModel[RandomProjectionModel] with RandomProjectionParams {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: (Vector) => Vector = {
+key: Vector => {
+  val hashValues: Array[Double] = randUnitVectors.map({
+randUnitVector => Math.floor(BLAS.dot(key, randUnitVector) / 
$(bucketLength))
+  })
+  Vectors.dense(hashValues)
+}
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+Math.sqrt(Vectors.sqdist(x, y))
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * This [[RandomProjection]] implements Locality Sensitive Hashing 
functions with 2-stable
+ * distributions.
+ *
+ * References:
+ * Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv 
preprint
+ * arXiv:1408.2927 (2014).
+ */
+@Experimental
+@Since("2.1.0")
+class RandomProjection private[ml] (
+  @Since("2.1.0") override val uid: String) extends 
LSH[RandomProjectionModel]
+  with RandomProjectionParams {
+
+  @Since("2.1.0")
+  override def setInputCol(value: String): this.type = 
super.setInputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputCol(value: String): this.type = 
super.setOutputCol(value)
+
+  @Since("2.1.0")
+  override def setOutputDim(value: Int): this.type = 
super.setOutputDim(value)
+
+  @Since("2.1.0")
+  private[ml] def this() = {
+this(Identifiable.randomUID("random projection"))
+  }
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setBucketLength(value: Double): this.type = set(bucketLength, value)
+
+  @Since("2.1.0")
+  override protected[this] def createRawLSHModel(inputDim: Int): 
RandomProjectionModel = {
+val randUnitVectors: Array[Vector] = {
+  Array.fill($(outputDim)) {
+val randArray = Array.fill(inputDim)(Random.nextGaussian())
+Vectors.fromBreeze(normalize(breeze.linalg.Vector(randArray)))
+  }
+}
+new RandomProjectionModel(uid, randUnitVectors)
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+require(schema.apply($(inputCol)).dataType.sameType(new VectorUDT),
---

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445477
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Model produced by [[MinHash]]
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Seq[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  Vectors.dense(hashFunctions.map(
+func => elems.toSparse.indices.toList.map(func).min.toDouble
+  ).toArray)
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * LSH class for Jaccard distance
--- End diff --

newline after this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445726
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import breeze.linalg.normalize
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Params for [[RandomProjection]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait RandomProjectionParams extends Params {
+  @Since("2.1.0")
+  val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength",
--- End diff --

Add Scala doc for bucketLength.  Some guidance on reasonable value ranges 
would be good.  E.g., "If input vectors have unit norm, then "

In doc, put bucketLength in ```@group param```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445890
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import breeze.linalg.normalize
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.StructType
+
+/**
+ * Params for [[RandomProjection]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait RandomProjectionParams extends Params {
+  @Since("2.1.0")
+  val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength",
+"the length of each hash bucket, a larger bucket lowers the false 
negative rate.",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getBucketLength: Double = $(bucketLength)
+}
+
+/**
+ * Model produced by [[RandomProjection]]
+ */
+@Experimental
+@Since("2.1.0")
+class RandomProjectionModel private[ml] (
+@Since("2.1.0") override val uid: String,
+@Since("2.1.0") val randUnitVectors: Array[Vector])
+  extends LSHModel[RandomProjectionModel] with RandomProjectionParams {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: (Vector) => Vector = {
+key: Vector => {
+  val hashValues: Array[Double] = randUnitVectors.map({
+randUnitVector => Math.floor(BLAS.dot(key, randUnitVector) / 
$(bucketLength))
+  })
+  Vectors.dense(hashValues)
+}
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+Math.sqrt(Vectors.sqdist(x, y))
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+}
+
+/**
+ * This [[RandomProjection]] implements Locality Sensitive Hashing 
functions with 2-stable
+ * distributions.
--- End diff --

Give some intuition for what "2-stable" means, or put it in details below 
since most users will not need to know about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11601: [SPARK-13568] [ML] Create feature transformer to impute ...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11601
  
**[Test build #66516 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66516/consoleFull)**
 for PR 11601 at commit 
[`8744524`](https://github.com/apache/spark/commit/8744524e8da174316207cb4c33b425cbbd78f68e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...

2016-10-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15239


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2016-10-07 Thread steveloughran

Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/12004
  
# Packaging:

1. this addresses the problem that it's not always immediately obvious to 
people what they have to do to get, say s3a working. Do you know precisely 
which version of amazon-aws-SDK you need to have on your CP for a specific 
version of hadoo-aws.jar to avoid getting a linkage error? That's the problem 
maven handles for you.
1. with a new module. it lets downstream applications build with that 
support, knowing that issues related to dependency versions have been handled 
for them.

# Documentation

It has an overview of how to use this stuff, lists those dependencies, 
explains whether they can be used as a direct destination for work, why the 
Direct committer was taken away, etc.

# Testing

The tests makes sure everything works. That's the packaging, the versioning 
of jackson, propagation of configuration options, failure handling, etc. Which 
offers: 

1. Verifying the packaging. The initial role of the tests was to make sure 
the classpaths were coming in right, filesystems registering, etc.
1. Compliance testing of the object stores client libraries: have they 
implemented the relevant APIs the way they are meant to, so that Spark can use 
them to list, read, write data.
1. Regression testing of the hadoop client libs: functionality and 
performance. This module, along with some Hive stuff, is the basis for 
benchmarking S3A performance improvements.
1. Regression testing of spark functionality/performance; highlighting 
places to tune stuff like directory listing operations.
1. Regression testing of cloud infras themselves. More relevant with 
Openstack than the others, as that's the ones where you can go against nightly 
builds.
1. Cross object store benchmarking. Compare how long it takes the dataframe 
example to complete in Azure vs S3a, and crank up the debugging to see where 
the delays are (it's in s3 copy being way, way slower; looks like Azure is not 
actually copying bytes).
1. Integration testing. That is, rather than just do a minimal scalatest 
operation, you can use spark-submit to submit the work to a full cluster, so 
verify that the right JARs made it out, the cluster isn't running incompatible 
versions of  the JVM and joda time, etc, etc.

With this module, then, people get the option of building Spark with the 
JARs on the CP. But they also gain the ability to have Jenkins set up to make 
sure that everything works, all the time.
It also provides the placeholder to add any code specific to object stores, 
like, perhaps some kind of committer. I don't have any plans there, but others 
might.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445385
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,334 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate", 
ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  // TODO: Decide about this default. It should probably depend on the 
particular LSH algorithm.
+  setDefault(outputDim -> 1, outputCol -> "lshFeatures")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the

[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...

2016-10-07 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15239#discussion_r82445435
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -341,11 +342,13 @@ setMethod("toDF", signature(x = "RDD"),
 #' @name read.json
 #' @method read.json default
 #' @note read.json since 1.6.0
-read.json.default <- function(path) {
+read.json.default <- function(path, ...) {
   sparkSession <- getSparkSession()
+  options <- varargsToStrEnv(...)
   # Allow the user to have a more flexible definiton of the text file path
   paths <- as.list(suppressWarnings(normalizePath(path)))
   read <- callJMethod(sparkSession, "read")
+  read <- callJMethod(read, "options", options)
--- End diff --

ah, thank you for the very detailed analysis and tests.
I think generally it would be great to match the scala/python behavior (but 
not only because to match it) for read to include all path(s).

```
> read.json(path = "hyukjin.json", path = "felix.json")
Error in dispatchFunc("read.json(path)", x, ...) :
  argument "x" is missing, with no default
```
This is because of the parameter hack.

```
> read.df(path = "hyukjin.json", path = "felix.json", source = "json")
Error in f(x, ...) :
  formal argument "path" matched by multiple actual arguments
```
Think read.df is unique somewhat in the sense the first parameter is named 
`path` - this is both helpful (if we don't want to support multiple path like 
this) or bad (user can't specify multiple paths)

```
> varargsToStrEnv("a", 1, 2, 3)

```
This case is somewhat dangerous - I think we end by passing a list of 
properties without name to the JVM side - it might be a good idea to check for 
`zero-length variable name` - perhaps could you open a JIRA on that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r82445404
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,334 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate", 
ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  // TODO: Decide about this default. It should probably depend on the 
particular LSH algorithm.
+  setDefault(outputDim -> 1, outputCol -> "lshFeatures")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+@Experimental
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the

[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15366
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66508/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15367: [SPARK-17346][SQL][test-maven]Add Kafka source for Struc...

2016-10-07 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15367
  
Thanks! I'm going to merge this one since the concern from @koeninger is 
addressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15366
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15366
  
**[Test build #66508 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66508/consoleFull)**
 for PR 15366 at commit 
[`29b007b`](https://github.com/apache/spark/commit/29b007b68a9b1633b4191ef93d379c01fda2f393).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...

2016-10-07 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15389#discussion_r82443930
  
--- Diff: python/pyspark/rdd.py ---
@@ -2029,7 +2030,11 @@ def coalesce(self, numPartitions, shuffle=False):
 >>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect()
 [[1, 2, 3, 4, 5]]
 """
-jrdd = self._jrdd.coalesce(numPartitions, shuffle)
+if shuffle:
--- End diff --

Seems you could just call `repartition` here to avoid the code duplication 
or swap `repartition` to call `coalesce`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...

2016-10-07 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15389#discussion_r82444378
  
--- Diff: python/pyspark/rdd.py ---
@@ -2029,7 +2030,11 @@ def coalesce(self, numPartitions, shuffle=False):
 >>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect()
 [[1, 2, 3, 4, 5]]
 """
-jrdd = self._jrdd.coalesce(numPartitions, shuffle)
+if shuffle:
+data_java_rdd = 
self._to_java_object_rdd().coalesce(numPartitions, shuffle)
+jrdd = self.ctx._jvm.SerDeUtil.javaToPython(data_java_rdd)
--- End diff --

I'm not as familiar with this part as I should be, but do we have a good 
idea of how expensive this is? It might be good to do some quick benchmarking 
just to make sure that this change doesn't have any unintended side effects?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...

2016-10-07 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15354#discussion_r82443073
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -343,4 +343,23 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   null
 )
   }
+
+  test("to_json") {
+val schema = StructType(StructField("a", IntegerType) :: Nil)
+val struct = Literal.create(create_row(1), schema)
+checkEvaluation(
+  StructToJson(Map.empty, struct),
+  """{"a":1}"""
+)
+  }
+
+  test("to_json - invalid type") {
+val schema = StructType(StructField("a", CalendarIntervalType) :: Nil)
--- End diff --

Hmm, I realize this is a little different than `from_json`, but it seems it 
would be better to eagerly throw an `AnalysisException` to say the schema 
contains an unsupported type.  We know that ahead of time, and otherwise its 
kind of mysterious why all the values come out as `null`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...

2016-10-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14087


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...

2016-10-07 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15263
  
Option.contains is only in Scala 2.11...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...

2016-10-07 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15263
  
Seems it breaks scala 2.10 compilation. Can you take a look? Thanks!

```
[error] 
/home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala:73:
 value contains is not a member of Option[Boolean]
[error] if (isTruncate && 
isCascadingTruncateTable(url).contains(false)) {
[error] ^
[error] one error found
[error] (sql/compile:compileIncremental) Compilation failed
[error] Total time: 240 s, completed Oct 7, 2016 11:02:49 AM
Build step 'Execute shell' marked build as failure
Finished: FAILURE
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Structured...

2016-10-07 Thread marmbrus

Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/14087
  
Thanks, merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14426
  
**[Test build #66515 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66515/consoleFull)**
 for PR 14426 at commit 
[`5290081`](https://github.com/apache/spark/commit/52900815e84d3f9a854168e0f0a1c2d555e96417).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries

2016-10-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14426
  
Hi, @gatorsmile .
Could you review this PR when you have some time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15377: [SPARK-17802] Improved caller context logging.

2016-10-07 Thread steveloughran

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/15377#discussion_r82441129
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2474,25 +2474,36 @@ private[spark] class CallerContext(
val context = "SPARK_" + from + appIdStr + appAttemptIdStr +
  jobIdStr + stageIdStr + stageAttemptIdStr + taskIdStr + 
taskAttemptNumberStr
 
+   private var callerContextSupported: Boolean = true
+
   /**
* Set up the caller context [[context]] by invoking Hadoop 
CallerContext API of
* [[org.apache.hadoop.ipc.CallerContext]], which was added in hadoop 
2.8.
*/
   def setCurrentContext(): Boolean = {
-var succeed = false
-try {
-  // scalastyle:off classforname
-  val callerContext = 
Class.forName("org.apache.hadoop.ipc.CallerContext")
-  val Builder = 
Class.forName("org.apache.hadoop.ipc.CallerContext$Builder")
-  // scalastyle:on classforname
-  val builderInst = 
Builder.getConstructor(classOf[String]).newInstance(context)
-  val hdfsContext = Builder.getMethod("build").invoke(builderInst)
-  callerContext.getMethod("setCurrent", callerContext).invoke(null, 
hdfsContext)
-  succeed = true
-} catch {
-  case NonFatal(e) => logInfo("Fail to set Spark caller context", e)
+if (!callerContextSupported) {
+  false
+} else {
+  try {
+// scalastyle:off classforname
+val callerContext = 
Class.forName("org.apache.hadoop.ipc.CallerContext")
+val builder = 
Class.forName("org.apache.hadoop.ipc.CallerContext$Builder")
+// scalastyle:on classforname
+val builderInst = 
builder.getConstructor(classOf[String]).newInstance(context)
+val hdfsContext = builder.getMethod("build").invoke(builderInst)
+callerContext.getMethod("setCurrent", callerContext).invoke(null, 
hdfsContext)
+true
+  } catch {
+case e: ClassNotFoundException =>
+  logInfo(
+s"Fail to set Spark caller context: requires Hadoop 2.8 or 
later: ${e.getMessage}")
+  callerContextSupported = false
+  false
+case NonFatal(e) =>
+  logWarning("Fail to set Spark caller context", e)
--- End diff --

There's some different deployment situations here.

1. you are running on Hadoop 2.8, want caller context. A failure here is 
something to mention.
1. you are running on Hadoop <= 2.7, don't want context or care about it. 
Here another stack trace is going to be a distraction; if it's not a support 
call then it gets added to the list of "error messages you learn to ignore". 
(this is my current state, BTW)
1. you want caller context, but are running on an incompatible version of 
Hadoop. Again, here, logging the CNFE makes sense.

Question is: do you need anything if the caller context is disabled? As I 
don't see you do. And there's a Hadoop config option 
`hadoop.caller.context.enabled` (default false), which controls that.

What about looking for the config option, if it is set going through the 
introspection work, reporting problems with stack traces. And if unset: don't 
even bother with the introspection?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15386: [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4...

2016-10-07 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15386
  
Yeah I figured you'd need to run `./dev/test-dependencies.sh 
--replace-manifest` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...

2016-10-07 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/15370
  
Ah you are right, sorry I totally missed that this is purely a sorting 
problem.  I was thinking the %05d was causing an issue but it doesn't.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15370
  
**[Test build #66514 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66514/consoleFull)**
 for PR 15370 at commit 
[`d9403f8`](https://github.com/apache/spark/commit/d9403f8d15fc4da03fb5b13609b1b837ef239c92).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2016-10-07 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/12004
  
@steveloughran can you clarify what this does? It seems like just 5000 
lines of examples and test cases? Users can already use these cloud stores by 
just adding the proper dependencies, can't they?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15390: [SPARK-17806] [SQL] fix bug in join key rewritten in Has...

2016-10-07 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15390
  
LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...

2016-10-07 Thread dhruve

Github user dhruve commented on the issue:

https://github.com/apache/spark/pull/15370
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...

2016-10-07 Thread dhruve

Github user dhruve commented on the issue:

https://github.com/apache/spark/pull/15370
  
All tests passed. Error unrelated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12004
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66513/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12004
  
**[Test build #66513 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66513/consoleFull)**
 for PR 12004 at commit 
[`a216aed`](https://github.com/apache/spark/commit/a216aed9a009c41a90131a8d6de04bb54c504a17).
 * This patch **fails build dependency tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2016-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12004
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...

2016-10-07 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15263
  
Thanks! Merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelatio...

2016-10-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15263


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15392: [SPARK-17830] Annotate spark.sql package with InterfaceS...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15392
  
**[Test build #66511 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66511/consoleFull)**
 for PR 15392 at commit 
[`d298170`](https://github.com/apache/spark/commit/d298170948da3cc5a5ca50a9bd4ca03c9f6de145).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12004
  
**[Test build #66513 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66513/consoleFull)**
 for PR 12004 at commit 
[`a216aed`](https://github.com/apache/spark/commit/a216aed9a009c41a90131a8d6de04bb54c504a17).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15391: [MINOR][ML]:remove redundant comment in LogisticRegressi...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15391
  
**[Test build #66512 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66512/consoleFull)**
 for PR 15391 at commit 
[`ad9de62`](https://github.com/apache/spark/commit/ad9de627b2c55bcb32435111e73d7e7e7f65df18).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15392: [SPARK-17830] Annotate spark.sql package with InterfaceS...

2016-10-07 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15392
  
cc @marmbrus want to review this? It is pretty important to make sure the 
APIs are properly annotated.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15392: [SPARK-17830] Annotat spark.sql package with Inte...

2016-10-07 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/15392

[SPARK-17830] Annotat spark.sql package with InterfaceStability

## What changes were proposed in this pull request?
This patch annotates the InterfaceStability level for top level classes in 
o.a.spark.sql and o.a.spark.sql.util packages, to experiment with this new 
annotation.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-17830

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15392


commit d298170948da3cc5a5ca50a9bd4ca03c9f6de145
Author: Reynold Xin 
Date:   2016-10-07T17:43:50Z

[SPARK-17830] Annotat spark.sql package with InterfaceStability




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15391: [MINOR][ML]:remove redundant comment in LogisticR...

2016-10-07 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/15391

[MINOR][ML]:remove redundant comment in LogisticRegression

## What changes were proposed in this pull request?
While adding R wrapper for LogisticRegression, I found one extra comment. 
It is minor and I just remove it.


## How was this patch tested?
Unit tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark mlordoc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15391.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15391


commit ad9de627b2c55bcb32435111e73d7e7e7f65df18
Author: wm...@hotmail.com 
Date:   2016-10-07T17:45:11Z

remove redundant comment




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15381: [SPARK-17707] [WEBUI] Web UI prevents spark-submit appli...

2016-10-07 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15381
  
FYI, I fixed import conflicts in JettyUtils manually for branch-2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15374: [SPARK-17800] Introduce InterfaceStability annota...

2016-10-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15374#discussion_r82433404
  
--- Diff: 
common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java 
---
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.annotation;
+
+import java.lang.annotation.Documented;
+
+/**
+ * Annotation to inform users of how much to rely on a particular package,
+ * class or method not changing over time.
+ */
+public class InterfaceStability {
+
+  /**
+   * Stable APIs that retain source and binary compatibility within a 
major release.
+   * These interfaces can change from one major release to another major 
release
+   * (e.g. from 1.0 to 2.0).
+   */
+  @Documented
+  public @interface Stable {};
+
+  /**
+   * APIs that are meant to evolve towards becoming stable APIs, but are 
not stable APIs yet.
+   * Evolving interfaces can change from one feature release to another 
release (i.e. 2.1 to 2.2).
+   */
+  @Documented
+  public @interface Evolving {};
--- End diff --

In my view "Evolving" would mean "this is sort of what the API will look 
like, there might be some adjustments in the next release, so you have to be 
willing to rev your code when the next release comes out".

"Unstable" means "user at your own risk, API may even not really work, and 
can be deleted without notice".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15381: [SPARK-17707] [WEBUI] Web UI prevents spark-submi...

2016-10-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15381


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15381: [SPARK-17707] [WEBUI] Web UI prevents spark-submit appli...

2016-10-07 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15381
  
LGTM. Merging to master and 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15390: [SPARK-17806] [SQL] fix bug in join key rewritten in Has...

2016-10-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15390
  
**[Test build #66510 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66510/consoleFull)**
 for PR 15390 at commit 
[`974fadb`](https://github.com/apache/spark/commit/974fadb748488c1375d8ba93b6b4b3290b0e02e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15218: [SPARK-17637][Scheduler]Packed scheduling for Spark task...

2016-10-07 Thread zhzhan

Github user zhzhan commented on the issue:

https://github.com/apache/spark/pull/15218
  
@mridulm Thanks for the comments. Your concern regarding the locality is 
right. The patch does not change this behavior, which takes priority of 
locality preference. But if multiple executors satisfying the locality 
restriction, the policy will be applied. In our production pipeline, we do see 
a big gain with respect to reserved cpu resources when dynamic allocation is 
enabled. 

@kayousterhout Would you like take a look and provide your comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15390: [SPARK-17806] [SQL] fix bug in join key rewritten in Has...

2016-10-07 Thread davies

Github user davies commented on the issue:

https://github.com/apache/spark/pull/15390
  
cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15390: [SPARK-17806] [SQL] fix bug in join key rewritten...

2016-10-07 Thread davies

GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/15390

[SPARK-17806] [SQL] fix bug in join key rewritten in HashJoin

## What changes were proposed in this pull request?

In HashJoin, we try to rewrite the join key as Long to improve the 
performance of finding a match. The rewriting part is not well tested, has a 
bug that could cause wrong result when there are at least three integral 
columns in the joining key also the total length of the key exceed 8 bytes. 

## How was this patch tested?

Added unit test to covering the rewriting with different number of columns 
and different data types. Manually test the reported case and confirmed that 
this PR fix the bug.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark rewrite_key

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15390.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15390


commit 974fadb748488c1375d8ba93b6b4b3290b0e02e2
Author: Davies Liu 
Date:   2016-10-07T17:20:07Z

fix join key rewritten




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15374: [SPARK-17800] Introduce InterfaceStability annota...

2016-10-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15374


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...

2016-10-07 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15365
  
cc @sethah @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 >

401 - 500 of 677 matches

Mail list logo