[GitHub] spark pull request #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to regis...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/9766#discussion_r82450939 --- Diff: python/pyspark/sql/context.py --- @@ -202,6 +202,10 @@ def registerFunction(self, name, f, returnType=StringType()): """ self.sparkSession.catalog.registerFunction(name, f, returnType) +def registerJavaFunction(self, name, javaClassName, returnType): --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14963: [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14963 I think this could be a good change for allowing more developers to onboard with PySpark - is there any interest in the current PySpark/Build focused committers [ @davies @srowen @rxin ] in seeing this change (or something similar)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15377 **[Test build #66506 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66506/consoleFull)** for PR 15377 at commit [`f1962e4`](https://github.com/apache/spark/commit/f1962e41981c80bb366a8a2060edadee59f95022). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/15393#discussion_r82450501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -70,7 +70,7 @@ class JdbcRelationProvider extends CreatableRelationProvider if (tableExists) { mode match { case SaveMode.Overwrite => -if (isTruncate && isCascadingTruncateTable(url).contains(false)) { +if (isTruncate && isCascadingTruncateTable(url).exists(_ == false)) { --- End diff -- Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15239#discussion_r82450472 --- Diff: R/pkg/R/SQLContext.R --- @@ -341,11 +342,13 @@ setMethod("toDF", signature(x = "RDD"), #' @name read.json #' @method read.json default #' @note read.json since 1.6.0 -read.json.default <- function(path) { +read.json.default <- function(path, ...) { sparkSession <- getSparkSession() + options <- varargsToStrEnv(...) # Allow the user to have a more flexible definiton of the text file path paths <- as.list(suppressWarnings(normalizePath(path))) read <- callJMethod(sparkSession, "read") + read <- callJMethod(read, "options", options) --- End diff -- Yeap, let me try to organise the unsolved comments here and https://github.com/apache/spark/issues/15231 if there is any! Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11601: [SPARK-13568] [ML] Create feature transformer to impute ...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/11601 Thanks for the comments @MLnick @jkbradley @sethah I have sent update according to the comments and change `ImputerModel.surrogate` and persistence format into DataFrame. As for the multiple columns support, do we need to assemble the imputed column together into one output column or should we support multiple output columns? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15377 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15377 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66507/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9766: [SPARK-11775][PYSPARK][SQL] Allow PySpark to register Jav...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/9766 Maybe @marmbrus could take a look if @davies is busy? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15393: [HOTFIX][BUILD] Do not use contains in Option in JdbcRel...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15393 **[Test build #66521 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66521/consoleFull)** for PR 15393 at commit [`d052ab6`](https://github.com/apache/spark/commit/d052ab663e0b6c246070e7dd2d4728a9708c8ac9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15377: [SPARK-17802] Improved caller context logging.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15377 **[Test build #66507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66507/consoleFull)** for PR 15377 at commit [`f6557d6`](https://github.com/apache/spark/commit/f6557d6e2fd79cf778966b0665fdd493a0de0b0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models should cont...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14653 Huh I'm not sure why jenkins isn't picking this up - @jkbradley or @davidnavas can you tell jenkins this is ok to test again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15393#discussion_r82449644 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -70,7 +70,7 @@ class JdbcRelationProvider extends CreatableRelationProvider if (tableExists) { mode match { case SaveMode.Overwrite => -if (isTruncate && isCascadingTruncateTable(url).contains(false)) { +if (isTruncate && isCascadingTruncateTable(url).exists(_ == false)) { --- End diff -- Oh, yes I think that looks nicer. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15355: [SPARK-17782][STREAMING][BUILD] Add Kafka 0.10 project t...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15355 Also check-picked this one into branch 2.0 since it's also helpful for 2.0 backport PRs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/15393#discussion_r82449290 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -70,7 +70,7 @@ class JdbcRelationProvider extends CreatableRelationProvider if (tableExists) { mode match { case SaveMode.Overwrite => -if (isTruncate && isCascadingTruncateTable(url).contains(false)) { +if (isTruncate && isCascadingTruncateTable(url).exists(_ == false)) { --- End diff -- Is the original one better? I mean '== Some(false)'? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/15393#discussion_r82449361 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala --- @@ -70,7 +70,7 @@ class JdbcRelationProvider extends CreatableRelationProvider if (tableExists) { mode match { case SaveMode.Overwrite => -if (isTruncate && isCascadingTruncateTable(url).contains(false)) { +if (isTruncate && isCascadingTruncateTable(url).exists(_ == false)) { --- End diff -- The current one also looks good to me. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15384: [SPARK-17346][SQL][Tests]Fix the flaky topic deletion in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15384 **[Test build #66520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66520/consoleFull)** for PR 15384 at commit [`0addb26`](https://github.com/apache/spark/commit/0addb262bbfdf4a260f5a9c5686903a900743533). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15384: [SPARK-17346][SQL][Tests]Fix the flaky topic deletion in...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15384 Since deleting a normal topic may be timeout as well: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/1763/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/_It_is_not_a_test_/ I just removed the topic deletion codes. It's not necessary since we are going to shut down the kafka cluster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15365 It would be great to get some feedback on the name `spark.logit` What do folks think about it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15365: [SPARK-17157][SPARKR]: Add multiclass logistic re...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15365#discussion_r82448237 --- Diff: R/pkg/R/mllib.R --- @@ -647,6 +654,195 @@ setMethod("predict", signature(object = "KMeansModel"), predict_internal(object, newData) }) +#' Logistic Regression Model +#' +#' Fits an logistic regression model against a Spark DataFrame. It supports "binomial": Binary logistic regression +#' with pivoting; "multinomial": Multinomial logistic (softmax) regression without pivoting, similar to glmnet. +#' Users can print, make predictions on the produced model and save the model to the input path. +#' +#' @param data SparkDataFrame for training +#' @param formula A symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param regParam the regularization parameter. Default is 0.0. +#' @param elasticNetParam the ElasticNet mixing parameter. For alpha = 0, the penalty is an L2 penalty. +#'For alpha = 1, it is an L1 penalty. For 0 < alpha < 1, the penalty is a combination +#'of L1 and L2. Default is 0.0 which is an L2 penalty. +#' @param maxIter maximum iteration number. +#' @param tol convergence tolerance of iterations. +#' @param fitIntercept whether to fit an intercept term. Default is TRUE. +#' @param family the name of family which is a description of the label distribution to be used in the model. +#' Supported options: +#' - "auto": Automatically select the family based on the number of classes: +#' If numClasses == 1 || numClasses == 2, set to "binomial". +#' Else, set to "multinomial". +#' - "binomial": Binary logistic regression with pivoting. +#' - "multinomial": Multinomial logistic (softmax) regression without pivoting. +#' Default is "auto". +#' @param standardization whether to standardize the training features before fitting the model. The coefficients +#'of models will be always returned on the original scale, so it will be transparent for +#'users. Note that with/without standardization, the models should be always converged +#'to the same solution when no regularization is applied. Default is TRUE, same as glmnet. +#' @param threshold in binary classification, in range [0, 1]. If the estimated probability of class label 1 +#' is > threshold, then predict 1, else 0. A high threshold encourages the model to predict 0 +#' more often; a low threshold encourages the model to predict 1 more often. Note: Setting this with +#' threshold p is equivalent to setting thresholds (Array(1-p, p)). When threshold is set, any user-set +#' value for thresholds will be cleared. If both threshold and thresholds are set, then they must be +#' equivalent. Default is 0.5. +#' @param thresholds in multiclass (or binary) classification to adjust the probability of predicting each class. +#' Array must have length equal to the number of classes, with values > 0, excepting that at most one +#' value may be 0. The class with largest value p/t is predicted, where p is the original probability +#' of that class and t is the class's threshold. Note: When thresholds is set, any user-set +#' value for threshold will be cleared. If both threshold and thresholds are set, then they must be +#' equivalent. Default is NULL. +#' @param weightCol The weight column name. +#' @param aggregationDepth depth for treeAggregate (>= 2). If the dimensions of features or the number of partitions +#' are large, this param could be adjusted to a larger size. Default is 2. +#' @param ... additional arguments passed to the method. +#' @return \code{spark.logit} returns a fitted logistic regression model +#' @rdname spark.logit +#' @aliases spark.logit,SparkDataFrame,formula-method +#' @name spark.logit +#' @export +#' @examples +#' \dontrun{ +#' sparkR.session() +#' # binary logistic regression +#' label <- c(1.0, 1.0, 1.0, 0.0, 0.0) +#' feature <- c(1.1419053, 0.9194079, -0.9498666, -1.1069903, 0.2809776) +#' binary_data <- as.data.frame(cbind(label, feature)) +#' binary_df <- suppressWarnings(createDataFrame(binary_data)) --- End diff -- why is `suppressWarnings` needed here? --- If your project is set up for it, you can reply to this email and
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66517/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #66517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66517/consoleFull)** for PR 15375 at commit [`6ea7fb3`](https://github.com/apache/spark/commit/6ea7fb3fd2c95c65ae557d8df2b954f25ef4ceca). * This patch **fails R style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class InterfaceStability ` * ` sealed trait ConsumerStrategy ` * ` case class SubscribeStrategy(topics: Seq[String], kafkaParams: ju.Map[String, Object])` * ` case class SubscribePatternStrategy(` * `case class KafkaSourceOffset(partitionToOffsets: Map[TopicPartition, Long]) extends Offset ` * `abstract class StreamExecutionThread(name: String) extends UninterruptibleThread(name)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15375 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15393: [HOTFIX][BUILD] Do not use contains in Option in JdbcRel...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15393 **[Test build #66519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66519/consoleFull)** for PR 15393 at commit [`83df63b`](https://github.com/apache/spark/commit/83df63b1303ee41838b3524737ce53ee2ac5e17f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15263 Oh, thank you for pointing this out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15393: [HOTFIX][BUILD] Do not use contains in Option in JdbcRel...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15393 @yhuai @zsxwing Do you mind if I ask to take a look please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15148 Related to the docs, some more comments defining terminology would be useful for non-experts: * OR-amplification * probing buckets * false positives/negatives (w.r.t. finding nearest neighbors) Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15393: [HOTFIX][BUILD] Do not use contains in Option in ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/15393 [HOTFIX][BUILD] Do not use contains in Option in JdbcRelationProvider ## What changes were proposed in this pull request? This PR proposes the fix the use of `contains` API which only exists from Scala 2.11. ## How was this patch tested? Manually checked the API in Scala - https://github.com/scala/scala/blob/2.10.x/src/library/scala/Option.scala#L218 You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark hotfix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15393.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15393 commit 83df63b1303ee41838b3524737ce53ee2ac5e17f Author: hyukjinkwonDate: 2016-10-07T18:40:04Z Hotfix do not use contains in Option --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15375: [SPARK-17790] Support for parallelizing R data.frame lar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15375 **[Test build #66517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66517/consoleFull)** for PR 15375 at commit [`6ea7fb3`](https://github.com/apache/spark/commit/6ea7fb3fd2c95c65ae557d8df2b954f25ef4ceca). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15366 **[Test build #66518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66518/consoleFull)** for PR 15366 at commit [`29b007b`](https://github.com/apache/spark/commit/29b007b68a9b1633b4191ef93d379c01fda2f393). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15367: [SPARK-17346][SQL][test-maven]Add Kafka source fo...
Github user zsxwing closed the pull request at: https://github.com/apache/spark/pull/15367 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14561: [SPARK-16972][CORE] Move DriverEndpoint out of CoarseGra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14561 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15375: [SPARK-17790] Support for parallelizing R data.fr...
Github user falaki commented on a diff in the pull request: https://github.com/apache/spark/pull/15375#discussion_r82446808 --- Diff: R/pkg/R/context.R --- @@ -123,19 +126,48 @@ parallelize <- function(sc, coll, numSlices = 1) { if (numSlices > length(coll)) numSlices <- length(coll) + sizeLimit <- as.numeric(sparkR.conf( + "spark.r.maxAllocationLimit", + toString(.Machine$integer.max / 2) # Default to a safe default: 200MB + )) + objectSize <- object.size(coll) + + # For large objects we make sure the size of each slice is also smaller than sizeLimit + numSlices <- max(numSlices, ceiling(objectSize / sizeLimit)) + sliceLen <- ceiling(length(coll) / numSlices) slices <- split(coll, rep(1: (numSlices + 1), each = sliceLen)[1:length(coll)]) # Serialize each slice: obtain a list of raws, or a list of lists (slices) of # 2-tuples of raws serializedSlices <- lapply(slices, serialize, connection = NULL) - jrdd <- callJStatic("org.apache.spark.api.r.RRDD", - "createRDDFromArray", sc, serializedSlices) + # The PRC backend cannot handle arguments larger than 2GB (INT_MAX) + # If serialized data is safely less than that threshold we send it over the PRC channel. + # Otherwise, we write it to a file and send the file name + if (objectSize < sizeLimit) { +jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, serializedSlices) + } else { +fileName <- writeToTempFile(serializedSlices) +jrdd <- callJStatic( + "org.apache.spark.api.r.RRDD", "createRDDFromFile", sc, fileName, as.integer(numSlices)) +file.remove(fileName) --- End diff -- Good point. Done! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/15366 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445466 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Model produced by [[MinHash]] + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + Vectors.dense(hashFunctions.map( +func => elems.toSparse.indices.toList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble --- End diff -- Check for invalid input where the denominator is 0. It looks like that cannot happen currently since hashFunction is always called first, but it'd be good to protect against future changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445905 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DataTypes + +private[ml] object LSHTest { + /** + * For any locality sensitive function h in a metric space, we meed to verify whether + * the following property is satisfied. + * + * There exist d1, d2, p1, p2, so that for any two elements e1 and e2, --- End diff -- d1,d2 -> dist1,dist2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445623 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Model produced by [[MinHash]] + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + Vectors.dense(hashFunctions.map( +func => elems.toSparse.indices.toList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * LSH class for Jaccard distance + * The input set should be represented in sparse vector form. For example, + *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)]) + * means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5 + */ +@Experimental +@Since("2.1.0") +class MinHash private[ml] (override val uid: String) extends LSH[MinHashModel] { + + private[this] val prime = 2038074743 --- End diff -- How was this prime chosen? Some comment here for future developers would be helpful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445705 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import breeze.linalg.normalize + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Params for [[RandomProjection]]. + */ +@Experimental --- End diff -- No need for Experimental or Since annotations for private trait. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445915 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DataTypes + +private[ml] object LSHTest { + /** + * For any locality sensitive function h in a metric space, we meed to verify whether + * the following property is satisfied. + * + * There exist d1, d2, p1, p2, so that for any two elements e1 and e2, + * If dist(e1, e2) >= dist1, then Pr{h(x) == h(y)} >= p1 + * If dist(e1, e2) <= dist2, then Pr{h(x) != h(y)} <= p2 --- End diff -- should be: "If dist(e1, e2) >= dist2, then Pr{h(x) == h(y)} <= p2" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445482 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Model produced by [[MinHash]] + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + Vectors.dense(hashFunctions.map( +func => elems.toSparse.indices.toList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * LSH class for Jaccard distance + * The input set should be represented in sparse vector form. For example, --- End diff -- Clarify: The input can be dense or sparse, but it is more efficient if it is sparse. Also clarify that the non-zero indices matter but that all non-zero values are treated as binary "1" values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445651 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Model produced by [[MinHash]] + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + Vectors.dense(hashFunctions.map( +func => elems.toSparse.indices.toList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * LSH class for Jaccard distance + * The input set should be represented in sparse vector form. For example, + *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)]) + * means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5 + */ +@Experimental +@Since("2.1.0") +class MinHash private[ml] (override val uid: String) extends LSH[MinHashModel] { + + private[this] val prime = 2038074743 + + @Since("2.1.0") + override def setInputCol(value: String): this.type = super.setInputCol(value) + + @Since("2.1.0") + override def setOutputCol(value: String): this.type = super.setOutputCol(value) + + @Since("2.1.0") + override def setOutputDim(value: Int): this.type = super.setOutputDim(value) + + private[this] lazy val randSeq: Seq[Int] = { +Seq.fill($(outputDim))(1 + Random.nextInt(prime - 1)).take($(outputDim)) --- End diff -- Why have ```.take(...)```? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445670 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Model produced by [[MinHash]] + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + Vectors.dense(hashFunctions.map( +func => elems.toSparse.indices.toList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * LSH class for Jaccard distance + * The input set should be represented in sparse vector form. For example, + *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)]) + * means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5 + */ +@Experimental +@Since("2.1.0") +class MinHash private[ml] (override val uid: String) extends LSH[MinHashModel] { + + private[this] val prime = 2038074743 + + @Since("2.1.0") + override def setInputCol(value: String): this.type = super.setInputCol(value) + + @Since("2.1.0") + override def setOutputCol(value: String): this.type = super.setOutputCol(value) + + @Since("2.1.0") + override def setOutputDim(value: Int): this.type = super.setOutputDim(value) + + private[this] lazy val randSeq: Seq[Int] = { +Seq.fill($(outputDim))(1 + Random.nextInt(prime - 1)).take($(outputDim)) + } + + @Since("2.1.0") + private[ml] def this() = { +this(Identifiable.randomUID("min hash")) + } + + @Since("2.1.0") + override protected[this] def createRawLSHModel(inputDim: Int): MinHashModel = { +val numEntry = inputDim * 2 --- End diff -- This could overflow. Use ```inputDim < prime / 2 + 1``` instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445698 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Model produced by [[MinHash]] + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + Vectors.dense(hashFunctions.map( +func => elems.toSparse.indices.toList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * LSH class for Jaccard distance + * The input set should be represented in sparse vector form. For example, + *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)]) + * means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5 + */ +@Experimental +@Since("2.1.0") +class MinHash private[ml] (override val uid: String) extends LSH[MinHashModel] { + + private[this] val prime = 2038074743 + + @Since("2.1.0") + override def setInputCol(value: String): this.type = super.setInputCol(value) + + @Since("2.1.0") + override def setOutputCol(value: String): this.type = super.setOutputCol(value) + + @Since("2.1.0") + override def setOutputDim(value: Int): this.type = super.setOutputDim(value) + + private[this] lazy val randSeq: Seq[Int] = { +Seq.fill($(outputDim))(1 + Random.nextInt(prime - 1)).take($(outputDim)) + } + + @Since("2.1.0") + private[ml] def this() = { +this(Identifiable.randomUID("min hash")) + } + + @Since("2.1.0") + override protected[this] def createRawLSHModel(inputDim: Int): MinHashModel = { +val numEntry = inputDim * 2 +require(numEntry < prime, "The input vector dimension is too large for MinHash to handle.") +val hashFunctions: Seq[Int => Long] = { + (0 until $(outputDim)).map { i: Int => +// Perfect Hash function, use 2n buckets to reduce collision. +elem: Int => (1 + elem) * randSeq(i).toLong % prime % numEntry + } +} +new MinHashModel(uid, hashFunctions) + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +require(schema.apply($(inputCol)).dataType.sameType(new VectorUDT), --- End diff -- You can use ```SchemaUtils.checkColumnType``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail:
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445744 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import breeze.linalg.normalize + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Params for [[RandomProjection]]. + */ +@Experimental +@Since("2.1.0") +private[ml] trait RandomProjectionParams extends Params { + @Since("2.1.0") + val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength", +"the length of each hash bucket, a larger bucket lowers the false negative rate.", +ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getBucketLength: Double = $(bucketLength) +} + +/** + * Model produced by [[RandomProjection]] --- End diff -- For Experimental classes, begin the Scaladoc with a line with: ``` :: Experimental :: ``` (See e.g. MultilayerPerceptronClassifier) Also, add doc for randUnitVectors since it shows up in the API as member data: ``` @param randUnitVectors ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445909 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DataTypes + +private[ml] object LSHTest { + /** + * For any locality sensitive function h in a metric space, we meed to verify whether + * the following property is satisfied. + * + * There exist d1, d2, p1, p2, so that for any two elements e1 and e2, + * If dist(e1, e2) >= dist1, then Pr{h(x) == h(y)} >= p1 --- End diff -- ">=" should be "<=" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445506 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Model produced by [[MinHash]] + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + Vectors.dense(hashFunctions.map( +func => elems.toSparse.indices.toList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * LSH class for Jaccard distance + * The input set should be represented in sparse vector form. For example, + *Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)]) --- End diff -- This will show up as code if you surround it with single back ticks: ``` `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445715 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import breeze.linalg.normalize + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Params for [[RandomProjection]]. + */ +@Experimental +@Since("2.1.0") +private[ml] trait RandomProjectionParams extends Params { + @Since("2.1.0") --- End diff -- (But keep this annotation since bucketLength is made public in subclasses.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445919 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DataTypes + +private[ml] object LSHTest { + /** + * For any locality sensitive function h in a metric space, we meed to verify whether + * the following property is satisfied. + * + * There exist d1, d2, p1, p2, so that for any two elements e1 and e2, + * If dist(e1, e2) >= dist1, then Pr{h(x) == h(y)} >= p1 + * If dist(e1, e2) <= dist2, then Pr{h(x) != h(y)} <= p2 + * + * This is called locality sensitive property. This method checks the property on an + * existing dataset and calculate the probabilities. + * (https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Definition) + * + * @param dataset The dataset to verify the locality sensitive hashing property. + * @param lsh The lsh instance to perform the hashing + * @param dist1 Distance threshold for false positive --- End diff -- Rename dist1,dist2 to distFP, distFN --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445897 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import breeze.linalg.normalize + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Params for [[RandomProjection]]. + */ +@Experimental +@Since("2.1.0") +private[ml] trait RandomProjectionParams extends Params { + @Since("2.1.0") + val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength", +"the length of each hash bucket, a larger bucket lowers the false negative rate.", +ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getBucketLength: Double = $(bucketLength) +} + +/** + * Model produced by [[RandomProjection]] + */ +@Experimental +@Since("2.1.0") +class RandomProjectionModel private[ml] ( +@Since("2.1.0") override val uid: String, +@Since("2.1.0") val randUnitVectors: Array[Vector]) + extends LSHModel[RandomProjectionModel] with RandomProjectionParams { + + @Since("2.1.0") + override protected[this] val hashFunction: (Vector) => Vector = { +key: Vector => { + val hashValues: Array[Double] = randUnitVectors.map({ +randUnitVector => Math.floor(BLAS.dot(key, randUnitVector) / $(bucketLength)) + }) + Vectors.dense(hashValues) +} + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +Math.sqrt(Vectors.sqdist(x, y)) + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * This [[RandomProjection]] implements Locality Sensitive Hashing functions with 2-stable + * distributions. + * + * References: + * Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint + * arXiv:1408.2927 (2014). + */ +@Experimental +@Since("2.1.0") +class RandomProjection private[ml] ( + @Since("2.1.0") override val uid: String) extends LSH[RandomProjectionModel] + with RandomProjectionParams { + + @Since("2.1.0") + override def setInputCol(value: String): this.type = super.setInputCol(value) + + @Since("2.1.0") + override def setOutputCol(value: String): this.type = super.setOutputCol(value) + + @Since("2.1.0") + override def setOutputDim(value: Int): this.type = super.setOutputDim(value) + + @Since("2.1.0") + private[ml] def this() = { +this(Identifiable.randomUID("random projection")) + } + + /** @group setParam */ + @Since("2.1.0") + def setBucketLength(value: Double): this.type = set(bucketLength, value) + + @Since("2.1.0") + override protected[this] def createRawLSHModel(inputDim: Int): RandomProjectionModel = { +val randUnitVectors: Array[Vector] = { + Array.fill($(outputDim)) { +val randArray = Array.fill(inputDim)(Random.nextGaussian()) +Vectors.fromBreeze(normalize(breeze.linalg.Vector(randArray))) + } +} +new RandomProjectionModel(uid, randUnitVectors) + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +require(schema.apply($(inputCol)).dataType.sameType(new VectorUDT), ---
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445477 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Model produced by [[MinHash]] + */ +@Experimental +@Since("2.1.0") +class MinHashModel private[ml] (override val uid: String, hashFunctions: Seq[Int => Long]) + extends LSHModel[MinHashModel] { + + @Since("2.1.0") + override protected[this] val hashFunction: Vector => Vector = { +elems: Vector => + require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") + Vectors.dense(hashFunctions.map( +func => elems.toSparse.indices.toList.map(func).min.toDouble + ).toArray) + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +val xSet = x.toSparse.indices.toSet +val ySet = y.toSparse.indices.toSet +1 - xSet.intersect(ySet).size.toDouble / xSet.union(ySet).size.toDouble + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * LSH class for Jaccard distance --- End diff -- newline after this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445726 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import breeze.linalg.normalize + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Params for [[RandomProjection]]. + */ +@Experimental +@Since("2.1.0") +private[ml] trait RandomProjectionParams extends Params { + @Since("2.1.0") + val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength", --- End diff -- Add Scala doc for bucketLength. Some guidance on reasonable value ranges would be good. E.g., "If input vectors have unit norm, then " In doc, put bucketLength in ```@group param```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445890 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import breeze.linalg.normalize + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.linalg.{BLAS, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{DoubleParam, Params, ParamValidators} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.types.StructType + +/** + * Params for [[RandomProjection]]. + */ +@Experimental +@Since("2.1.0") +private[ml] trait RandomProjectionParams extends Params { + @Since("2.1.0") + val bucketLength: DoubleParam = new DoubleParam(this, "bucketLength", +"the length of each hash bucket, a larger bucket lowers the false negative rate.", +ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getBucketLength: Double = $(bucketLength) +} + +/** + * Model produced by [[RandomProjection]] + */ +@Experimental +@Since("2.1.0") +class RandomProjectionModel private[ml] ( +@Since("2.1.0") override val uid: String, +@Since("2.1.0") val randUnitVectors: Array[Vector]) + extends LSHModel[RandomProjectionModel] with RandomProjectionParams { + + @Since("2.1.0") + override protected[this] val hashFunction: (Vector) => Vector = { +key: Vector => { + val hashValues: Array[Double] = randUnitVectors.map({ +randUnitVector => Math.floor(BLAS.dot(key, randUnitVector) / $(bucketLength)) + }) + Vectors.dense(hashValues) +} + } + + @Since("2.1.0") + override protected[ml] def keyDistance(x: Vector, y: Vector): Double = { +Math.sqrt(Vectors.sqdist(x, y)) + } + + @Since("2.1.0") + override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +// Since it's generated by hashing, it will be a pair of dense vectors. +x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min + } +} + +/** + * This [[RandomProjection]] implements Locality Sensitive Hashing functions with 2-stable + * distributions. --- End diff -- Give some intuition for what "2-stable" means, or put it in details below since most users will not need to know about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11601: [SPARK-13568] [ML] Create feature transformer to impute ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11601 **[Test build #66516 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66516/consoleFull)** for PR 11601 at commit [`8744524`](https://github.com/apache/spark/commit/8744524e8da174316207cb4c33b425cbbd78f68e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15239 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/12004 # Packaging: 1. this addresses the problem that it's not always immediately obvious to people what they have to do to get, say s3a working. Do you know precisely which version of amazon-aws-SDK you need to have on your CP for a specific version of hadoo-aws.jar to avoid getting a linkage error? That's the problem maven handles for you. 1. with a new module. it lets downstream applications build with that support, knowing that issues related to dependency versions have been handled for them. # Documentation It has an overview of how to use this stuff, lists those dependencies, explains whether they can be used as a direct destination for work, why the Direct committer was taken away, etc. # Testing The tests makes sure everything works. That's the packaging, the versioning of jackson, propagation of configuration options, failure handling, etc. Which offers: 1. Verifying the packaging. The initial role of the tests was to make sure the classpaths were coming in right, filesystems registering, etc. 1. Compliance testing of the object stores client libraries: have they implemented the relevant APIs the way they are meant to, so that Spark can use them to list, read, write data. 1. Regression testing of the hadoop client libs: functionality and performance. This module, along with some Hive stuff, is the basis for benchmarking S3A performance improvements. 1. Regression testing of spark functionality/performance; highlighting places to tune stuff like directory listing operations. 1. Regression testing of cloud infras themselves. More relevant with Openstack than the others, as that's the ones where you can go against nightly builds. 1. Cross object store benchmarking. Compare how long it takes the dataframe example to complete in Azure vs S3a, and crank up the debugging to see where the delays are (it's in s3 copy being way, way slower; looks like Azure is not actually copying bytes). 1. Integration testing. That is, rather than just do a minimal scalatest operation, you can use spark-submit to submit the work to a full cluster, so verify that the right JARs made it out, the cluster isn't running incompatible versions of the JVM and joda time, etc, etc. With this module, then, people get the option of building Spark with the JARs on the CP. But they also gain the ability to have Jenkins set up to make sure that everything works, all the time. It also provides the placeholder to add any code specific to object stores, like, perhaps some kind of committer. I don't have any plans there, but others might. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445385 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,334 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + // TODO: Decide about this default. It should probably depend on the particular LSH algorithm. + setDefault(outputDim -> 1, outputCol -> "lshFeatures") + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without outputCol + * @return A derived schema with outputCol added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * Model produced by [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One of the point in the metric space + * @param y Another the point in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the
[GitHub] spark pull request #15239: [SPARK-17665][SPARKR] Support options/mode all fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15239#discussion_r82445435 --- Diff: R/pkg/R/SQLContext.R --- @@ -341,11 +342,13 @@ setMethod("toDF", signature(x = "RDD"), #' @name read.json #' @method read.json default #' @note read.json since 1.6.0 -read.json.default <- function(path) { +read.json.default <- function(path, ...) { sparkSession <- getSparkSession() + options <- varargsToStrEnv(...) # Allow the user to have a more flexible definiton of the text file path paths <- as.list(suppressWarnings(normalizePath(path))) read <- callJMethod(sparkSession, "read") + read <- callJMethod(read, "options", options) --- End diff -- ah, thank you for the very detailed analysis and tests. I think generally it would be great to match the scala/python behavior (but not only because to match it) for read to include all path(s). ``` > read.json(path = "hyukjin.json", path = "felix.json") Error in dispatchFunc("read.json(path)", x, ...) : argument "x" is missing, with no default ``` This is because of the parameter hack. ``` > read.df(path = "hyukjin.json", path = "felix.json", source = "json") Error in f(x, ...) : formal argument "path" matched by multiple actual arguments ``` Think read.df is unique somewhat in the sense the first parameter is named `path` - this is both helpful (if we don't want to support multiple path like this) or bad (user can't specify multiple paths) ``` > varargsToStrEnv("a", 1, 2, 3) ``` This case is somewhat dangerous - I think we end by passing a list of properties without name to the JVM side - it might be a good idea to check for `zero-length variable name` - perhaps could you open a JIRA on that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r82445404 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,334 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import scala.util.Random + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.SchemaUtils +import org.apache.spark.sql._ +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for the dimension of LSH OR-amplification. + * + * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The + * higher the dimension is, the lower the false negative rate. + * @group param + */ + @Since("2.1.0") + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" + +"increasing dimensionality lowers the false negative rate", ParamValidators.gt(0)) + + /** @group getParam */ + @Since("2.1.0") + final def getOutputDim: Int = $(outputDim) + + // TODO: Decide about this default. It should probably depend on the particular LSH algorithm. + setDefault(outputDim -> 1, outputCol -> "lshFeatures") + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without outputCol + * @return A derived schema with outputCol added + */ + @Since("2.1.0") + protected[this] final def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) + } +} + +/** + * Model produced by [[LSH]]. + */ +@Experimental +@Since("2.1.0") +private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] with LSHParams { + self: T => + + @Since("2.1.0") + override def copy(extra: ParamMap): T = defaultCopy(extra) + + /** + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + @Since("2.1.0") + protected[this] val hashFunction: Vector => Vector + + /** + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One of the point in the metric space + * @param y Another the point in the metric space + * @return The distance between x and y + */ + @Since("2.1.0") + protected[ml] def keyDistance(x: Vector, y: Vector): Double + + /** + * Calculate the distance between two different hash Vectors. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y + */ + @Since("2.1.0") + protected[ml] def hashDistance(x: Vector, y: Vector): Double + + @Since("2.1.0") + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + @Since("2.1.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15366 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66508/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15367: [SPARK-17346][SQL][test-maven]Add Kafka source for Struc...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15367 Thanks! I'm going to merge this one since the concern from @koeninger is addressed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15366 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15366: [SPARK-17793] [Web UI] Sorting on the description on the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15366 **[Test build #66508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66508/consoleFull)** for PR 15366 at commit [`29b007b`](https://github.com/apache/spark/commit/29b007b68a9b1633b4191ef93d379c01fda2f393). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15389#discussion_r82443930 --- Diff: python/pyspark/rdd.py --- @@ -2029,7 +2030,11 @@ def coalesce(self, numPartitions, shuffle=False): >>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect() [[1, 2, 3, 4, 5]] """ -jrdd = self._jrdd.coalesce(numPartitions, shuffle) +if shuffle: --- End diff -- Seems you could just call `repartition` here to avoid the code duplication or swap `repartition` to call `coalesce`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15389: [SPARK-17817][PySpark] PySpark RDD Repartitioning...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15389#discussion_r82444378 --- Diff: python/pyspark/rdd.py --- @@ -2029,7 +2030,11 @@ def coalesce(self, numPartitions, shuffle=False): >>> sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(1).glom().collect() [[1, 2, 3, 4, 5]] """ -jrdd = self._jrdd.coalesce(numPartitions, shuffle) +if shuffle: +data_java_rdd = self._to_java_object_rdd().coalesce(numPartitions, shuffle) +jrdd = self.ctx._jvm.SerDeUtil.javaToPython(data_java_rdd) --- End diff -- I'm not as familiar with this part as I should be, but do we have a good idea of how expensive this is? It might be good to do some quick benchmarking just to make sure that this change doesn't have any unintended side effects? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r82443073 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala --- @@ -343,4 +343,23 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { null ) } + + test("to_json") { +val schema = StructType(StructField("a", IntegerType) :: Nil) +val struct = Literal.create(create_row(1), schema) +checkEvaluation( + StructToJson(Map.empty, struct), + """{"a":1}""" +) + } + + test("to_json - invalid type") { +val schema = StructType(StructField("a", CalendarIntervalType) :: Nil) --- End diff -- Hmm, I realize this is a little different than `from_json`, but it seems it would be better to eagerly throw an `AnalysisException` to say the schema contains an unsupported type. We know that ahead of time, and otherwise its kind of mysterious why all the values come out as `null`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14087 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15263 Option.contains is only in Scala 2.11... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15263 Seems it breaks scala 2.10 compilation. Can you take a look? Thanks! ``` [error] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala:73: value contains is not a member of Option[Boolean] [error] if (isTruncate && isCascadingTruncateTable(url).contains(false)) { [error] ^ [error] one error found [error] (sql/compile:compileIncremental) Compilation failed [error] Total time: 240 s, completed Oct 7, 2016 11:02:49 AM Build step 'Execute shell' marked build as failure Finished: FAILURE ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Structured...
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/14087 Thanks, merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14426 **[Test build #66515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66515/consoleFull)** for PR 14426 at commit [`5290081`](https://github.com/apache/spark/commit/52900815e84d3f9a854168e0f0a1c2d555e96417). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14426: [SPARK-16475][SQL] Broadcast Hint for SQL Queries
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/14426 Hi, @gatorsmile . Could you review this PR when you have some time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15377: [SPARK-17802] Improved caller context logging.
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/15377#discussion_r82441129 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2474,25 +2474,36 @@ private[spark] class CallerContext( val context = "SPARK_" + from + appIdStr + appAttemptIdStr + jobIdStr + stageIdStr + stageAttemptIdStr + taskIdStr + taskAttemptNumberStr + private var callerContextSupported: Boolean = true + /** * Set up the caller context [[context]] by invoking Hadoop CallerContext API of * [[org.apache.hadoop.ipc.CallerContext]], which was added in hadoop 2.8. */ def setCurrentContext(): Boolean = { -var succeed = false -try { - // scalastyle:off classforname - val callerContext = Class.forName("org.apache.hadoop.ipc.CallerContext") - val Builder = Class.forName("org.apache.hadoop.ipc.CallerContext$Builder") - // scalastyle:on classforname - val builderInst = Builder.getConstructor(classOf[String]).newInstance(context) - val hdfsContext = Builder.getMethod("build").invoke(builderInst) - callerContext.getMethod("setCurrent", callerContext).invoke(null, hdfsContext) - succeed = true -} catch { - case NonFatal(e) => logInfo("Fail to set Spark caller context", e) +if (!callerContextSupported) { + false +} else { + try { +// scalastyle:off classforname +val callerContext = Class.forName("org.apache.hadoop.ipc.CallerContext") +val builder = Class.forName("org.apache.hadoop.ipc.CallerContext$Builder") +// scalastyle:on classforname +val builderInst = builder.getConstructor(classOf[String]).newInstance(context) +val hdfsContext = builder.getMethod("build").invoke(builderInst) +callerContext.getMethod("setCurrent", callerContext).invoke(null, hdfsContext) +true + } catch { +case e: ClassNotFoundException => + logInfo( +s"Fail to set Spark caller context: requires Hadoop 2.8 or later: ${e.getMessage}") + callerContextSupported = false + false +case NonFatal(e) => + logWarning("Fail to set Spark caller context", e) --- End diff -- There's some different deployment situations here. 1. you are running on Hadoop 2.8, want caller context. A failure here is something to mention. 1. you are running on Hadoop <= 2.7, don't want context or care about it. Here another stack trace is going to be a distraction; if it's not a support call then it gets added to the list of "error messages you learn to ignore". (this is my current state, BTW) 1. you want caller context, but are running on an incompatible version of Hadoop. Again, here, logging the CNFE makes sense. Question is: do you need anything if the caller context is disabled? As I don't see you do. And there's a Hadoop config option `hadoop.caller.context.enabled` (default false), which controls that. What about looking for the config option, if it is set going through the introspection work, reporting problems with stack traces. And if unset: don't even bother with the introspection? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15386: [SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15386 Yeah I figured you'd need to run `./dev/test-dependencies.sh --replace-manifest` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/15370 Ah you are right, sorry I totally missed that this is purely a sorting problem. I was thinking the %05d was causing an issue but it doesn't. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15370 **[Test build #66514 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66514/consoleFull)** for PR 15370 at commit [`d9403f8`](https://github.com/apache/spark/commit/d9403f8d15fc4da03fb5b13609b1b837ef239c92). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/12004 @steveloughran can you clarify what this does? It seems like just 5000 lines of examples and test cases? Users can already use these cloud stores by just adding the proper dependencies, can't they? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15390: [SPARK-17806] [SQL] fix bug in join key rewritten in Has...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15390 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...
Github user dhruve commented on the issue: https://github.com/apache/spark/pull/15370 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15370: [SPARK-17417][Core] Fix # of partitions for Reliable RDD...
Github user dhruve commented on the issue: https://github.com/apache/spark/pull/15370 All tests passed. Error unrelated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12004 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66513/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12004 **[Test build #66513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66513/consoleFull)** for PR 12004 at commit [`a216aed`](https://github.com/apache/spark/commit/a216aed9a009c41a90131a8d6de04bb54c504a17). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12004 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvid...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15263 Thanks! Merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15263: [SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelatio...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15263 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15392: [SPARK-17830] Annotate spark.sql package with InterfaceS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15392 **[Test build #66511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66511/consoleFull)** for PR 15392 at commit [`d298170`](https://github.com/apache/spark/commit/d298170948da3cc5a5ca50a9bd4ca03c9f6de145). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12004 **[Test build #66513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66513/consoleFull)** for PR 12004 at commit [`a216aed`](https://github.com/apache/spark/commit/a216aed9a009c41a90131a8d6de04bb54c504a17). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15391: [MINOR][ML]:remove redundant comment in LogisticRegressi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15391 **[Test build #66512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66512/consoleFull)** for PR 15391 at commit [`ad9de62`](https://github.com/apache/spark/commit/ad9de627b2c55bcb32435111e73d7e7e7f65df18). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15392: [SPARK-17830] Annotate spark.sql package with InterfaceS...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15392 cc @marmbrus want to review this? It is pretty important to make sure the APIs are properly annotated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15392: [SPARK-17830] Annotat spark.sql package with Inte...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/15392 [SPARK-17830] Annotat spark.sql package with InterfaceStability ## What changes were proposed in this pull request? This patch annotates the InterfaceStability level for top level classes in o.a.spark.sql and o.a.spark.sql.util packages, to experiment with this new annotation. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-17830 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15392.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15392 commit d298170948da3cc5a5ca50a9bd4ca03c9f6de145 Author: Reynold XinDate: 2016-10-07T17:43:50Z [SPARK-17830] Annotat spark.sql package with InterfaceStability --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15391: [MINOR][ML]:remove redundant comment in LogisticR...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/15391 [MINOR][ML]:remove redundant comment in LogisticRegression ## What changes were proposed in this pull request? While adding R wrapper for LogisticRegression, I found one extra comment. It is minor and I just remove it. ## How was this patch tested? Unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark mlordoc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15391.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15391 commit ad9de627b2c55bcb32435111e73d7e7e7f65df18 Author: wm...@hotmail.comDate: 2016-10-07T17:45:11Z remove redundant comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15381: [SPARK-17707] [WEBUI] Web UI prevents spark-submit appli...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15381 FYI, I fixed import conflicts in JettyUtils manually for branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15374: [SPARK-17800] Introduce InterfaceStability annota...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/15374#discussion_r82433404 --- Diff: common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.annotation; + +import java.lang.annotation.Documented; + +/** + * Annotation to inform users of how much to rely on a particular package, + * class or method not changing over time. + */ +public class InterfaceStability { + + /** + * Stable APIs that retain source and binary compatibility within a major release. + * These interfaces can change from one major release to another major release + * (e.g. from 1.0 to 2.0). + */ + @Documented + public @interface Stable {}; + + /** + * APIs that are meant to evolve towards becoming stable APIs, but are not stable APIs yet. + * Evolving interfaces can change from one feature release to another release (i.e. 2.1 to 2.2). + */ + @Documented + public @interface Evolving {}; --- End diff -- In my view "Evolving" would mean "this is sort of what the API will look like, there might be some adjustments in the next release, so you have to be willing to rev your code when the next release comes out". "Unstable" means "user at your own risk, API may even not really work, and can be deleted without notice". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15381: [SPARK-17707] [WEBUI] Web UI prevents spark-submi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15381 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15381: [SPARK-17707] [WEBUI] Web UI prevents spark-submit appli...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15381 LGTM. Merging to master and 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15390: [SPARK-17806] [SQL] fix bug in join key rewritten in Has...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15390 **[Test build #66510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66510/consoleFull)** for PR 15390 at commit [`974fadb`](https://github.com/apache/spark/commit/974fadb748488c1375d8ba93b6b4b3290b0e02e2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15218: [SPARK-17637][Scheduler]Packed scheduling for Spark task...
Github user zhzhan commented on the issue: https://github.com/apache/spark/pull/15218 @mridulm Thanks for the comments. Your concern regarding the locality is right. The patch does not change this behavior, which takes priority of locality preference. But if multiple executors satisfying the locality restriction, the policy will be applied. In our production pipeline, we do see a big gain with respect to reserved cpu resources when dynamic allocation is enabled. @kayousterhout Would you like take a look and provide your comments? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15390: [SPARK-17806] [SQL] fix bug in join key rewritten in Has...
Github user davies commented on the issue: https://github.com/apache/spark/pull/15390 cc @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15390: [SPARK-17806] [SQL] fix bug in join key rewritten...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/15390 [SPARK-17806] [SQL] fix bug in join key rewritten in HashJoin ## What changes were proposed in this pull request? In HashJoin, we try to rewrite the join key as Long to improve the performance of finding a match. The rewriting part is not well tested, has a bug that could cause wrong result when there are at least three integral columns in the joining key also the total length of the key exceed 8 bytes. ## How was this patch tested? Added unit test to covering the rewriting with different number of columns and different data types. Manually test the reported case and confirmed that this PR fix the bug. You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark rewrite_key Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15390.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15390 commit 974fadb748488c1375d8ba93b6b4b3290b0e02e2 Author: Davies LiuDate: 2016-10-07T17:20:07Z fix join key rewritten --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15374: [SPARK-17800] Introduce InterfaceStability annota...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15374 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15365: [SPARK-17157][SPARKR]: Add multiclass logistic regressio...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15365 cc @sethah @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org