[GitHub] spark pull request: [SPARK-2827][GraphX]Add degree distribution op...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1767#issuecomment-72234437 [Test build #26409 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26409/consoleFull) for PR 1767 at commit [`1c35298`](https://github.com/apache/spark/commit/1c35298bfd3bea5b8eeba6bb4804b3fe74ff7fd9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2827][GraphX]Add degree distribution op...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1767#issuecomment-72234451 [Test build #26409 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26409/consoleFull) for PR 1767 at commit [`1c35298`](https://github.com/apache/spark/commit/1c35298bfd3bea5b8eeba6bb4804b3fe74ff7fd9). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2827][GraphX]Add degree distribution op...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1767#issuecomment-72234457 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26409/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5472][SQL] A JDBC data source for Spark...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4261#issuecomment-72230800 [Test build #26403 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26403/consoleFull) for PR 4261 at commit [`cf167ce`](https://github.com/apache/spark/commit/cf167cea9457e933b1b8ed5f0eff708e6535ef99). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JDBCUtils ` * ` logWarning(sCouldn't find class $driver, e);` * ` implicit class JDBCDataFrame(rdd: DataFrame) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5366][EC2] Check the mode of private ke...
Github user Liuchang0812 commented on the pull request: https://github.com/apache/spark/pull/4162#issuecomment-72235154 ubuntu@ip-172-31-24-113:~/spark/ec2$ ../dev/lint-python PEP 8 checks passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5400 [MLlib] Changed name of GaussianMix...
GitHub user tgaloppo opened a pull request: https://github.com/apache/spark/pull/4290 SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture Decoupling the model and the algorithm You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgaloppo/spark spark-5400 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4290.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4290 commit d8480761d98119a37b10a43a24bc6720e0e6eb87 Author: Travis Galoppo tjg2...@columbia.edu Date: 2015-01-30T15:20:55Z SPARK-5400 Changed name of GaussianMixtureEM to GaussianMixture to separate model from algorithm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user zapletal-martin commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23841288 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { --- End diff -- As for the special singularity case I believe this requires further considerations. Currently we just sort the input to PAV by feature therefore order of multiple data points with the same feature is undefined. Consider a case where features are 1, 2, 2, 3 and labels are in first case 1, 4, 2, 5 and in second case 1, 2, 4, 5. For first case the result of PAV would be 1, 3, 3, 5 but in second case 1, 2, 4, 5. Similarly for `IsotonicRegressionModel` with boundaries 1, 2, 2, 3 and predictions in first case 1, 4, 2, 5 and in second case 1, 2, 4, 5. Now the first mode would return predict(1.5)=2.5, predict(2.5)=3.5, but the second would return 1.5 and 4.5 respectively for the same input values. I suggest to sort the input by features and then by labels if features are equal. The same would be true for the model. Therefore both PAV and the predictions of values between boundaries would be deterministic. The predictions for the boundary with multiple values would remain non-deterministic (based on `Java.util.Arrays.binarySearch()` which in this case also returns one of the correct results, but does not specify which). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3798#issuecomment-72228772 [Test build #26401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26401/consoleFull) for PR 3798 at commit [`0090553`](https://github.com/apache/spark/commit/0090553eba09240b6ad4cf508ea33503705b12d9). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class DeterministicKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this) ` * `class KafkaCluster(val kafkaParams: Map[String, String]) extends Serializable ` * ` case class LeaderOffset(host: String, port: Int, offset: Long)` * `class KafkaRDDPartition(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/1484#issuecomment-72231302 @mengxr I'll do the updates today --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3798#issuecomment-72228780 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26401/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
GitHub user tgravescs opened a pull request: https://github.com/apache/spark/pull/4292 [SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs .this was https://github.com/apache/spark/pull/2676 https://issues.apache.org/jira/browse/SPARK-3778 This affects if someone is trying to access secure hdfs something like: val lines = { val hconf = new Configuration() hconf.set(mapred.input.dir, mydir) hconf.set(textinputformat.record.delimiter,\003432\n) sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) } You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgravescs/spark SPARK-3788 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4292.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4292 commit cf3b45337a1fb1da6492779709b2bf213bccbb16 Author: Thomas Graves tgra...@apache.org Date: 2014-10-06T14:53:29Z newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5472][SQL] A JDBC data source for Spark...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4261#issuecomment-72230810 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26403/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4216#discussion_r23867076 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala --- @@ -121,6 +122,17 @@ private[spark] class Master( throw new SparkException(spark.deploy.defaultCores must be positive) } + // Alternative application submission gateway that is stable across Spark versions + private val restServerEnabled = conf.getBoolean(spark.master.rest.enabled, true) + private val restServer = +if (restServerEnabled) { + val port = conf.getInt(spark.master.rest.port, 17077) --- End diff -- I made this 6066. Let me know if you have any objections --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/3637#discussion_r23879189 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala --- @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.ml.classification.{Classifier, ClassifierParams, ClassificationModel} +import org.apache.spark.ml.param.{Params, IntParam, ParamMap} +import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors} +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.sql.{DataFrame, Row, SQLContext} + + +/** + * A simple example demonstrating how to write your own learning algorithm using Estimator, + * Transformer, and other abstractions. + * This mimics [[org.apache.spark.ml.classification.LogisticRegression]]. + * Run with + * {{{ + * bin/run-example ml.DeveloperApiExample + * }}} + */ +object DeveloperApiExample { + + def main(args: Array[String]) { +val conf = new SparkConf().setAppName(DeveloperApiExample) +val sc = new SparkContext(conf) +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Prepare training data. +val training = sparkContext.parallelize(Seq( + LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), + LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), + LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), + LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5 + +// Create a LogisticRegression instance. This instance is an Estimator. +val lr = new MyLogisticRegression() +// Print out the parameters, documentation, and any default values. +println(MyLogisticRegression parameters:\n + lr.explainParams() + \n) + +// We may set parameters using setter methods. +lr.setMaxIter(10) + +// Learn a LogisticRegression model. This uses the parameters stored in lr. +val model = lr.fit(training) + +// Prepare test data. +val test = sparkContext.parallelize(Seq( + LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)), + LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)), + LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5 + +// Make predictions on test data. +val sumPredictions: Double = model.transform(test) + .select(features, label, prediction) + .collect() + .map { case Row(features: Vector, label: Double, prediction: Double) = +prediction + }.sum +assert(sumPredictions == 0.0, + MyLogisticRegression predicted something other than 0, even though all weights are 0!) + +sc.stop() + } +} + +/** + * Example of defining a parameter trait for a user-defined type of [[Classifier]]. + * + * NOTE: This is private since it is an example. In practice, you may not want it to be private. + */ +private trait MyLogisticRegressionParams extends ClassifierParams { + + /** + * Param for max number of iterations + * + * NOTE: The usual way to add a parameter to a model or algorithm is to include: + * - val myParamName: ParamType + * - def getMyParamName + * - def setMyParamName --- End diff -- Is the setter missing in this example or is it auto generated somehow ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3519#issuecomment-72252929 [Test build #26417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26417/consoleFull) for PR 3519 at commit [`3da56e5`](https://github.com/apache/spark/commit/3da56e530276a2ff7104993da893fe04e124392d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3519#issuecomment-72284082 [Test build #26434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26434/consoleFull) for PR 3519 at commit [`e3c0e44`](https://github.com/apache/spark/commit/e3c0e442ab591731c322ec9cc78530a7665a00b9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SPARK-SQL]fix bug when query the ...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/4289 [SPARK-5498][SPARK-SQL]fix bug when query the data when partition schema does not match table schema In hive,the schema of partition may be difference from the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the [jira](https://issues.apache.org/jira/browse/SPARK-5498) .For example: 1.We take a look of the schema for the partition and the table ```sql DESCRIBE partition_test PARTITION (dt='1'); id int None namestring None dt string None # Partition Information # col_name data_type comment dt string None ``` ``` DESCRIBE partition_test; OK id bigint None namestring None dt string None # Partition Information # col_name data_type comment dt string None ``` 2. run the sql ```sql SELECT * FROM partition_test where dt='1'; ``` we will get the cast exception `java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt` You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark schema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4289 commit adfc7defb278667d0c27c6128b00339bb8d52bb1 Author: jeanlyn jeanly...@gmail.com Date: 2015-01-30T13:48:21Z SPARK-5498:fix bug when query the data when partition schema does not match table schema --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-72232842 [Test build #26407 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26407/consoleFull) for PR 2847 at commit [`ec21f7d`](https://github.com/apache/spark/commit/ec21f7dfcad6191e0c2d6d7fd93ac77012098e6c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23858269 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -267,10 +277,22 @@ object SparkSubmit { // In yarn-cluster mode, use yarn.Client as a wrapper around the user class if (isYarnCluster) { childMainClass = org.apache.spark.deploy.yarn.Client - if (args.primaryResource != SPARK_INTERNAL) { -childArgs += (--jar, args.primaryResource) + if (args.isPython) {// yarn-cluster mode for python application + val primaryResourceLocalPath = new Path(args.primaryResource) +childArgs += (--primaryResource, primaryResourceLocalPath.getName) +val pyFilesLocalNames:String = if (args.pyFiles != null) { + args.pyFiles.split(,).map { p = (new Path(p)).getName }.mkString(,) --- End diff -- Also, it seems like if the primary resource is a jar, it isn't truncated with getName. Is there a reason this needs to be different for a python file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23878978 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -431,6 +458,155 @@ object SparkSubmit { } } +/** Provides utility functions to be used inside SparkSubmit. */ +private[spark] object SparkSubmitUtils extends Logging { + + // Directories for caching downloads through ivy and storing the jars when maven coordinates are + // supplied to spark-submit + private var PACKAGES_DIRECTORY: File = null + + /** + * Represents a Maven Coordinate + * @param groupId the groupId of the coordinate + * @param artifactId the artifactId of the coordinate + * @param version the version of the coordinate + */ + private[spark] case class MavenCoordinate(groupId: String, artifactId: String, version: String) + + /** + * Resolves any dependencies that were supplied through maven coordinates + * @param coordinates Comma-delimited string of maven coordinates + * @param remoteRepos Comma-delimited string of remote repositories other than maven central + * @param ivyPath The path to the local ivy repository + * @return The comma-delimited path to the jars of the given maven artifacts including their + * transitive dependencies + */ + private[spark] def resolveMavenCoordinates( + coordinates: String, + remoteRepos: String, + ivyPath: String, + isTest: Boolean = false): String = { --- End diff -- Also, what do you think about returning a Seq of paths and leaving it up to the caller to join them with commas? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SPARK-SQL]fix bug when query the ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-72206551 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4294#issuecomment-72268532 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26426/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-72248433 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26414/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4216#discussion_r23876918 --- Diff: core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolMessage.scala --- @@ -0,0 +1,209 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.rest + +import com.fasterxml.jackson.annotation._ +import com.fasterxml.jackson.annotation.JsonAutoDetect.Visibility +import com.fasterxml.jackson.annotation.JsonInclude.Include +import com.fasterxml.jackson.databind.ObjectMapper +import org.json4s.JsonAST._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.util.Utils + +/** + * An abstract message exchanged in the REST application submission protocol. + * + * This message is intended to be serialized to and deserialized from JSON in the exchange. + * Each message can either be a request or a response and consists of three common fields: + * (1) the action, which fully specifies the type of the message + * (2) the Spark version of the client / server + * (3) an optional message + */ +@JsonInclude(Include.NON_NULL) +@JsonAutoDetect(getterVisibility = Visibility.ANY, setterVisibility = Visibility.ANY) +@JsonPropertyOrder(alphabetic = true) +abstract class SubmitRestProtocolMessage { + private val messageType = Utils.getFormattedClassName(this) + protected val action: String = messageType + protected val sparkVersion: SubmitRestProtocolField[String] + protected val message = new SubmitRestProtocolField[String](message) + + // Required for JSON de/serialization and not explicitly used + private def getAction: String = action + private def setAction(s: String): this.type = this + + // Intended for the user and not for JSON de/serialization, which expects more specific keys + @JsonIgnore + def getSparkVersion: String + @JsonIgnore + def setSparkVersion(s: String): this.type + + def getMessage: String = message.toString + def setMessage(s: String): this.type = setField(message, s) + + /** + * Serialize the message to JSON. + * This also ensures that the message is valid and its fields are in the expected format. + */ + def toJson: String = { +validate() +val mapper = new ObjectMapper +pretty(parse(mapper.writeValueAsString(this))) --- End diff -- great --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3381] [MLlib] Eliminate bins for unorde...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4231#issuecomment-72265786 @MechCoder Sorry, there are a lot of PRs out there now, so this may not get merged before the code freeze. It's a good cleanup, though, so I'll definitely take a look when I can. Thanks for your patience. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4220#issuecomment-72268511 @jacek-lewandowski I think Sean meant that you can do `new Properties().putAll(oldProperties)` instead of cloning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23861868 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -172,7 +172,8 @@ private[spark] class SparkSubmitArguments(args: Seq[String], env: Map[String, St } // Require all python files to be local, so we can add them to the PYTHONPATH -if (isPython) { +// when yarn-cluster, all python files can be non-local +if (isPython !master.equalsIgnoreCase(yarn-cluster)) { --- End diff -- this is not sufficient. Users can manually specify `--master yarn --deploy-mode cluster` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4294#issuecomment-72268528 [Test build #26426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26426/consoleFull) for PR 4294 at commit [`a2f9c06`](https://github.com/apache/spark/commit/a2f9c0695ecd1c5a0ae334bde21740588ab81c29). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait TableScan extends BaseRelation ` * `trait PrunedScan extends BaseRelation ` * `trait PrunedFilteredScan extends BaseRelation ` * `trait CatalystScan extends BaseRelation ` * `trait InsertableRelation extends BaseRelation ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-72248428 [Test build #26414 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26414/consoleFull) for PR 4047 at commit [`3e0c894`](https://github.com/apache/spark/commit/3e0c8945640523f0747e2145ce5ca7b1d405b4ab). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class Tokenizer(sc: SparkContext, stopwordFile: String) extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23874490 --- Diff: bin/windows-utils.cmd --- @@ -32,7 +32,7 @@ SET opts=\--master\ \--deploy-mode\ \--class\ \--name\ \--jars\ \--p SET opts=%opts:~1,-1% \--conf\ \--properties-file\ \--driver-memory\ \--driver-java-options\ SET opts=%opts:~1,-1% \--driver-library-path\ \--driver-class-path\ \--executor-memory\ SET opts=%opts:~1,-1% \--driver-cores\ \--total-executor-cores\ \--executor-cores\ \--queue\ -SET opts=%opts:~1,-1% \--num-executors\ \--archives\ +SET opts=%opts:~1,-1% \--num-executors\ \--archives\ \--packages\ \--repositories\ --- End diff -- Looks like this line is now missing a closing quote? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user jacek-lewandowski commented on the pull request: https://github.com/apache/spark/pull/4220#issuecomment-72274441 It serializes the object and then deserializes so I suppose this is a deep copy. For the `stringPropertyNames` - you can, but this will not be a 1:1 copy: ```scala val parent = new Properties() parent.setProperty(test1, A) val child = new Properties(parent) child.put(test1, C) child.put(test2, B) child.getProperty(test1) child.remove(test1) child.getProperty(test1) ``` will give you ``` scala res17: Object = C scala res18: String = A ``` When you copy in the way you suggested, there will be `null` after removal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...
Github user avulanov commented on a diff in the pull request: https://github.com/apache/spark/pull/1484#discussion_r23870580 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.Experimental +import org.apache.spark.mllib.linalg +import org.apache.spark.mllib.linalg.{Vectors, Vector} +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.stat.Statistics +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * Chi Squared selector model. + * + * @param indices list of indices to select (filter) + */ +@Experimental +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer { + /** + * Applies transformation on a vector. + * + * @param vector vector to be transformed. + * @return transformed vector. + */ + override def transform(vector: linalg.Vector): linalg.Vector = { +Compress(vector, indices) --- End diff -- I though it would be useful in general for filtering features. Does it make sense? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3975] Added support for BlockMatrix add...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4274#discussion_r23875903 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -246,4 +248,86 @@ class BlockMatrix( val localMat = toLocalMatrix() new BDM[Double](localMat.numRows, localMat.numCols, localMat.toArray) } + + /** Adds two block matrices together. The matrices must have the same size and matching +* `rowsPerBlock` and `colsPerBlock` values. If one of the blocks that are being added are +* instances of [[SparseMatrix]], the resulting sub matrix will also be a [[SparseMatrix]], even +* if it is being added to a [[DenseMatrix]]. If two dense matrices are added, the output will +* also be a [[DenseMatrix]]. +*/ + def add(other: BlockMatrix): BlockMatrix = { +require(numRows() == other.numRows(), Both matrices must have the same number of rows. + + sA.numRows: ${numRows()}, B.numRows: ${other.numRows()}) +require(numCols() == other.numCols(), Both matrices must have the same number of columns. + + sA.numCols: ${numCols()}, B.numCols: ${other.numCols()}) +if (rowsPerBlock == other.rowsPerBlock colsPerBlock == other.colsPerBlock) { + val addedBlocks = blocks.cogroup(other.blocks, createPartitioner()) +.map { case ((blockRowIndex, blockColIndex), (a, b)) = + if (a.size 1 || b.size 1) { +throw new SparkException(There are MatrixBlocks with duplicate indices. Please + --- End diff -- Put `blockRowIndex` and `blockColIndex` in the message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23862088 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala --- @@ -48,6 +50,14 @@ class ApplicationMasterArguments(val args: Array[String]) { userClass = value args = tail +case (--primaryResource) :: value :: tail = --- End diff -- this should be `--primary-resource` instead of camel case for consistency --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: remove redundant field childOutput from exec...
GitHub user kai-zeng opened a pull request: https://github.com/apache/spark/pull/4291 remove redundant field childOutput from execution.Aggregate, use child.output instead You can merge this pull request into a Git repository by running: $ git pull https://github.com/kai-zeng/spark aggregate-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4291.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4291 commit 78658efffb3fe0632a5aafe45997c2ab24791475 Author: kai kaiz...@eecs.berkeley.edu Date: 2015-01-30T16:10:19Z remove redundant field childOutput --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3975] Added support for BlockMatrix add...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4274#issuecomment-72277417 [Test build #26425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26425/consoleFull) for PR 4274 at commit [`ac25783`](https://github.com/apache/spark/commit/ac25783cb125e1eea4728d0933f1295d43d0c442). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4216#issuecomment-72255456 [Test build #26413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26413/consoleFull) for PR 4216 at commit [`bf696ff`](https://github.com/apache/spark/commit/bf696ff0b7135883e53e5fb275b4afa0db6c4a4a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3519#issuecomment-72261726 [Test build #26416 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26416/consoleFull) for PR 3519 at commit [`75eac55`](https://github.com/apache/spark/commit/75eac55d2a168ca3452f08b403187c503cdbb45a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23862118 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -103,11 +104,15 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) userClass = value args = tail +case (--primaryResource) :: value :: tail = --- End diff -- I agree, since we don't ever set this to the main jar if we're not running python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5176] The thrift server does not suppor...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/4137#issuecomment-72280943 @andrewor14 Per our offline discuss, it still requires some minor work to make the Thrift server support standalone cluster mode (mainly related to the `spark-internal` argument). Currently, at least we don't want to add it in 1.3.0 yet. So this PR LGTM. @tpanningnextcen Thanks for working on this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4216#issuecomment-72255468 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26413/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23878483 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -431,6 +458,155 @@ object SparkSubmit { } } +/** Provides utility functions to be used inside SparkSubmit. */ +private[spark] object SparkSubmitUtils extends Logging { + + // Directories for caching downloads through ivy and storing the jars when maven coordinates are + // supplied to spark-submit + private var PACKAGES_DIRECTORY: File = null + + /** + * Represents a Maven Coordinate + * @param groupId the groupId of the coordinate + * @param artifactId the artifactId of the coordinate + * @param version the version of the coordinate + */ + private[spark] case class MavenCoordinate(groupId: String, artifactId: String, version: String) + + /** + * Resolves any dependencies that were supplied through maven coordinates + * @param coordinates Comma-delimited string of maven coordinates + * @param remoteRepos Comma-delimited string of remote repositories other than maven central + * @param ivyPath The path to the local ivy repository + * @return The comma-delimited path to the jars of the given maven artifacts including their + * transitive dependencies + */ + private[spark] def resolveMavenCoordinates( + coordinates: String, + remoteRepos: String, + ivyPath: String, + isTest: Boolean = false): String = { --- End diff -- Similarly, maybe the configuration of the ChainResolver could be done in its own helper method that takes the comma-separated list of remoteRepos and returns a resolver. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5500. Document that feeding hadoopFile i...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4293#issuecomment-72285133 This is good to put. One idea that just came to my mind is ... why don't the downstream operators inspect whether they need to do copys or not? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5366][EC2] Check the mode of private ke...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/4162#issuecomment-72223944 OK, LGTM pending Python style tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Don't return `ERROR 500` when have missing arg...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/4239#issuecomment-72286357 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4254#issuecomment-72279293 LGTM except minor user guide issues, which will be addressed in SPARK-5503. I've merged this into master. Thanks for the contributing! (Now MLlib depends on GraphX.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23878900 --- Diff: core/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import org.apache.spark.util.ResetSystemProperties +import org.scalatest.{Matchers, FunSuite} + +class SparkSubmitUtilsSuite extends FunSuite with Matchers with ResetSystemProperties { + + def beforeAll() { +System.setProperty(spark.testing, true) --- End diff -- If I recall, Maven already sets this property to `true` before running the tests. Is there a reason that we need this (and ResetSystemProperties) here, or is it a carry-over from another test suite? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4254 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23879324 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -431,6 +458,155 @@ object SparkSubmit { } } +/** Provides utility functions to be used inside SparkSubmit. */ +private[spark] object SparkSubmitUtils extends Logging { + + // Directories for caching downloads through ivy and storing the jars when maven coordinates are + // supplied to spark-submit + private var PACKAGES_DIRECTORY: File = null + + /** + * Represents a Maven Coordinate + * @param groupId the groupId of the coordinate + * @param artifactId the artifactId of the coordinate + * @param version the version of the coordinate + */ + private[spark] case class MavenCoordinate(groupId: String, artifactId: String, version: String) + + /** + * Resolves any dependencies that were supplied through maven coordinates + * @param coordinates Comma-delimited string of maven coordinates + * @param remoteRepos Comma-delimited string of remote repositories other than maven central + * @param ivyPath The path to the local ivy repository + * @return The comma-delimited path to the jars of the given maven artifacts including their + * transitive dependencies + */ + private[spark] def resolveMavenCoordinates( + coordinates: String, + remoteRepos: String, + ivyPath: String, + isTest: Boolean = false): String = { +if (coordinates == null || coordinates.trim.isEmpty) { + +} else { + val artifacts = coordinates.split(,).map { p = +val splits = p.split(:) +require(splits.length == 3, sProvided Maven Coordinates must be in the form + + s'groupId:artifactId:version'. The coordinate provided is: $p) +require(splits(0) != null splits(0).trim.nonEmpty, sThe groupId cannot be null or + + sbe whitespace. The groupId provided is: ${splits(0)}) +require(splits(1) != null splits(1).trim.nonEmpty, sThe artifactId cannot be null or + + sbe whitespace. The artifactId provided is: ${splits(1)}) +require(splits(2) != null splits(2).trim.nonEmpty, sThe version cannot be null or + + sbe whitespace. The version provided is: ${splits(2)}) +new MavenCoordinate(splits(0), splits(1), splits(2)) + } + // Default configuration name for ivy + val conf = default + // set ivy settings for location of cache + val ivySettings: IvySettings = new IvySettings + if (ivyPath == null || ivyPath.trim.isEmpty) { --- End diff -- Is `ivyPath` acting like an optional value here, since it can be null? If that's the case, it might be nice to use an `Option` to make its optional nature more explicit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23879684 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -123,6 +126,7 @@ private[spark] class SparkSubmitArguments(args: Seq[String], env: Map[String, St .orNull name = Option(name).orElse(sparkProperties.get(spark.app.name)).orNull jars = Option(jars).orElse(sparkProperties.get(spark.jars)).orNull +ivyRepoPath = sparkProperties.get(spark.jars.ivy).orNull --- End diff -- Actually in order to not expose it to users in spark-submit. I still wanted to have it as a configuration just for the flexibility. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5473] [EC2] Expose SSH failures after s...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/4262#issuecomment-72285208 I haven't tried out this solution, so I am not exactly sure what gets printed (I can do it over the weekend sometime). At a high-level my comment is that every attempt that checks if the cluster is `ssh-ready` should print some feedback on the screen so the user knows the script is not hun. If that is the case I'm fine with this solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-72265197 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26419/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user jacek-lewandowski commented on the pull request: https://github.com/apache/spark/pull/4220#issuecomment-72272628 Look at this simple example: ```scala val parent = new Properties() parent.setProperty(test1, A) val child = new Properties(parent) child.put(test2, B) val copy = new Properties() copy.putAll(child) child.getProperty(test1) child.getProperty(test2) copy.getProperty(test1) copy.getProperty(test2) ``` which will result in: ``` scala res3: String = A scala res4: String = B scala res5: String = null scala res6: String = B ``` In other words: `new Properties(oldProperties)` initialises a new properties by setting oldProperties as a parent (defaults). On the other hand `new Properties().putAll(oldProperties)` copies only those properties which were explicitly set and cuts the whole hierarchy with defaults. Only cloning gives you the same object. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/4216#discussion_r23876895 --- Diff: core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolField.scala --- @@ -0,0 +1,30 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.rest + +/** + * A field used in [[SubmitRestProtocolMessage]]s. + */ +class SubmitRestProtocolField[T](val name: String) { --- End diff -- sounds fair. An earlier commit had exactly what you suggest here actually. I just thought if we wanted to do extra validation and throw a different exception then we could re-use the name here, but this is no longer the case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5264][SQL] Support `drop temporary tabl...
Github user OopsOutOfMemory commented on a diff in the pull request: https://github.com/apache/spark/pull/4060#discussion_r23841357 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala --- @@ -231,6 +241,32 @@ private [sql] case class CreateTempTableUsing( } } +private[sql] case class DropTable( +tableName: String, +isExists: Boolean, +temporary: Boolean) extends Command --- End diff -- hi, @liancheng, I think we do need this logical `DropTable`. Since all parser should go first ddlParser, if not get plan, then it will try dialect parser. If I remove this `logical drop table`, when execute `drop table xxx`, it will always use `DropTableCommand` in `SQLContext`. Sorry if I'm wrong. In HiveContext. else if (conf.dialect == hiveql) { new SchemaRDD(this, ddlParser(sqlText, false).getOrElse(HiveQl.parseSql(substituted))) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23864907 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -267,10 +277,22 @@ object SparkSubmit { // In yarn-cluster mode, use yarn.Client as a wrapper around the user class if (isYarnCluster) { childMainClass = org.apache.spark.deploy.yarn.Client - if (args.primaryResource != SPARK_INTERNAL) { -childArgs += (--jar, args.primaryResource) + if (args.isPython) {// yarn-cluster mode for python application + val primaryResourceLocalPath = new Path(args.primaryResource) +childArgs += (--primaryResource, primaryResourceLocalPath.getName) +val pyFilesLocalNames:String = if (args.pyFiles != null) { + args.pyFiles.split(,).map { p = (new Path(p)).getName }.mkString(,) --- End diff -- @lianhuiwang quick question are you stripping the path prefix here because all the python files in YARN are already found in the working directory? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user brkyvz commented on the pull request: https://github.com/apache/spark/pull/4215#issuecomment-72282474 @JoshRosen thank you very much for the time and comments. I'll fix things immediately --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/4216#discussion_r23875609 --- Diff: core/src/main/scala/org/apache/spark/deploy/rest/SubmitRestProtocolField.scala --- @@ -0,0 +1,30 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.rest + +/** + * A field used in [[SubmitRestProtocolMessage]]s. + */ +class SubmitRestProtocolField[T](val name: String) { --- End diff -- I think you don't need `name` anymore -- you end up needing to repeat the field name a lot, when now jackson is taking care of putting the field name in the json. Looks like its only used in `assertFieldIsSet`, which is only called from `DriverStatusRequest` -- so you could just pass in a message for that one case and dry up a lot of the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23876362 --- Diff: docs/mllib-clustering.md --- @@ -34,6 +34,26 @@ a given dataset, the algorithm returns the best clustering result). * *initializationSteps* determines the number of steps in the k-means\|\| algorithm. * *epsilon* determines the distance threshold within which we consider k-means to have converged. +### Power Iteration Clustering + +Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: + +* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points. +* calculates the principal eigenvalue and eigenvector +* Clusters each of the input points according to their principal eigenvector component value + +Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf} --- End diff -- This is not the correct syntax for links in markdown. Use `[](...)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23879289 --- Diff: core/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy + +import org.apache.spark.util.ResetSystemProperties +import org.scalatest.{Matchers, FunSuite} + +class SparkSubmitUtilsSuite extends FunSuite with Matchers with ResetSystemProperties { + + def beforeAll() { +System.setProperty(spark.testing, true) --- End diff -- Carry-over from SparkSubmitSuite. It's my first time writing a core test, I can remove it if it's unnecessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23876361 --- Diff: docs/mllib-clustering.md --- @@ -34,6 +34,26 @@ a given dataset, the algorithm returns the best clustering result). * *initializationSteps* determines the number of steps in the k-means\|\| algorithm. * *epsilon* determines the distance threshold within which we consider k-means to have converged. +### Power Iteration Clustering + +Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: + +* accepts a [Graph](https://spark.apache.org/docs/0.9.2/api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points. --- End diff -- Should use relative path api/graphx/ See examples in this markdown file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user mccheah commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-72286220 Tried running the Streaming CheckpointSuite locally, and it broke because of the new CommitDeniedException logic I added. Don't have any ideas as to how this happens except that streaming might not be using SparkHadoopWriter in a way that is compatible with this design, perhaps... I don't think I'll be able to take this any further. Feel free to pick things up from here, @JoshRosen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/4220#issuecomment-72286197 @JoshRosen who investigated this a bunch for tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5473] [EC2] Expose SSH failures after s...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/4262#issuecomment-72286941 You can see example output in [the PR description](https://github.com/apache/spark/pull/4262#issue-55856344). I will look into adding feedback while the script is waiting on the cluster to reach a certain state. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user jacek-lewandowski commented on the pull request: https://github.com/apache/spark/pull/4220#issuecomment-72268328 @srowen - unfortunately they are something more - they inherit from the `HashTable` but they makes a hierarchy by referencing the parent `Properties` which are the defaults. As the defaults is also `Properties`, it has its own parent and so on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-72223705 I'll try to bring it up to date today. I'm out all next week though so if you find issues someone else might need to take it over. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23879094 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -431,6 +458,155 @@ object SparkSubmit { } } +/** Provides utility functions to be used inside SparkSubmit. */ +private[spark] object SparkSubmitUtils extends Logging { + + // Directories for caching downloads through ivy and storing the jars when maven coordinates are + // supplied to spark-submit + private var PACKAGES_DIRECTORY: File = null + + /** + * Represents a Maven Coordinate + * @param groupId the groupId of the coordinate + * @param artifactId the artifactId of the coordinate + * @param version the version of the coordinate + */ + private[spark] case class MavenCoordinate(groupId: String, artifactId: String, version: String) + + /** + * Resolves any dependencies that were supplied through maven coordinates + * @param coordinates Comma-delimited string of maven coordinates + * @param remoteRepos Comma-delimited string of remote repositories other than maven central + * @param ivyPath The path to the local ivy repository + * @return The comma-delimited path to the jars of the given maven artifacts including their + * transitive dependencies + */ + private[spark] def resolveMavenCoordinates( + coordinates: String, + remoteRepos: String, + ivyPath: String, + isTest: Boolean = false): String = { +if (coordinates == null || coordinates.trim.isEmpty) { + +} else { + val artifacts = coordinates.split(,).map { p = +val splits = p.split(:) +require(splits.length == 3, sProvided Maven Coordinates must be in the form + + s'groupId:artifactId:version'. The coordinate provided is: $p) +require(splits(0) != null splits(0).trim.nonEmpty, sThe groupId cannot be null or + + sbe whitespace. The groupId provided is: ${splits(0)}) +require(splits(1) != null splits(1).trim.nonEmpty, sThe artifactId cannot be null or + + sbe whitespace. The artifactId provided is: ${splits(1)}) +require(splits(2) != null splits(2).trim.nonEmpty, sThe version cannot be null or + + sbe whitespace. The version provided is: ${splits(2)}) +new MavenCoordinate(splits(0), splits(1), splits(2)) + } + // Default configuration name for ivy + val conf = default + // set ivy settings for location of cache + val ivySettings: IvySettings = new IvySettings + if (ivyPath == null || ivyPath.trim.isEmpty) { +PACKAGES_DIRECTORY = new File(ivySettings.getDefaultIvyUserDir, jars) + } else { +ivySettings.setDefaultCache(new File(ivyPath, cache)) +PACKAGES_DIRECTORY = new File(ivyPath, jars) + } + logInfo(sIvy Default Cache set to: ${ivySettings.getDefaultCache.getAbsolutePath}) + logInfo(sThe jars for the packages stored in: $PACKAGES_DIRECTORY) + + // create a pattern matcher + ivySettings.addMatcher(new GlobPatternMatcher) + + // the biblio resolver resolves POM declared dependencies + val br: IBiblioResolver = new IBiblioResolver + br.setM2compatible(true) + br.setUsepoms(true) + br.setName(central) + + // We need a chain resolver if we want to check multiple repositories + val cr = new ChainResolver + cr.setName(list) + cr.add(br) + + // Add an exclusion rule for Spark + val sparkArtifacts = new ArtifactId(new ModuleId(org.apache.spark, *), *, *, *) + val sparkDependencyExcludeRule = +new DefaultExcludeRule(sparkArtifacts, ivySettings.getMatcher(glob), null) + sparkDependencyExcludeRule.addConfiguration(conf) + + // add any other remote repositories other than maven central + if (remoteRepos != null remoteRepos.trim.nonEmpty) { +var i = 1 +remoteRepos.split(,).foreach { repo = + val brr: IBiblioResolver = new IBiblioResolver + brr.setM2compatible(true) + brr.setUsepoms(true) + brr.setRoot(repo) + brr.setName(srepo-$i) + cr.add(brr) + logInfo(s$repo added as a remote repository with the name: ${brr.getName}) + i += 1 +} + } + ivySettings.addResolver(cr) + ivySettings.setDefaultResolver(cr.getName) + val ivy = Ivy.newInstance(ivySettings) + // Set resolve options to download transitive dependencies as well + val resolveOptions = new ResolveOptions +
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3519#issuecomment-72198487 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26400/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...
Github user avulanov commented on a diff in the pull request: https://github.com/apache/spark/pull/1484#discussion_r23861260 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.Experimental +import org.apache.spark.mllib.linalg +import org.apache.spark.mllib.linalg.{Vectors, Vector} +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.stat.Statistics +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * Chi Squared selector model. + * + * @param indices list of indices to select (filter) + */ +@Experimental +class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer { + /** + * Applies transformation on a vector. + * + * @param vector vector to be transformed. + * @return transformed vector. + */ + override def transform(vector: linalg.Vector): linalg.Vector = { +Compress(vector, indices) + } +} + +/** + * :: Experimental :: + * Creates a ChiSquared feature selector. + */ +@Experimental +object ChiSqSelector { --- End diff -- Done! However, why do you think it is better than having static function given that this class does nothing but storing an integer (same for IDF)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72240509 This looks like the right approach. Added some comments inline. Are you able to add a test for this in `YarnClusterSuite`? Also, one last small thing: in `PythonRunner` are you able to remove the reference to Spark submit in the header comment, as this is now used in a more general way? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3519#issuecomment-72198477 [Test build #26400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26400/consoleFull) for PR 3519 at commit [`e60a34f`](https://github.com/apache/spark/commit/e60a34f3479ce3b642f5941497c3e5c1bbeebdd4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class IsotonicRegressionModel (` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5504] [sql] convertToCatalyst should su...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4295#issuecomment-72280766 [Test build #26429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26429/consoleFull) for PR 4295 at commit [`6b7276d`](https://github.com/apache/spark/commit/6b7276d44d0d578545f5c543de4167c0569fe4e1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23861241 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -138,8 +140,9 @@ object SparkSubmit { (clusterManager, deployMode) match { case (MESOS, CLUSTER) = printErrorAndExit(Cluster deploy mode is currently not supported for Mesos clusters.) - case (_, CLUSTER) if args.isPython = -printErrorAndExit(Cluster deploy mode is currently not supported for python applications.) + case (STANDALONE, CLUSTER) if args.isPython = +printErrorAndExit(Standalone-Cluster deploy mode is currently not supported + --- End diff -- yes please try to be consistent with other error messages here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4220#issuecomment-72273213 I see, makes sense, thanks for the details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-72265084 [Test build #26419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26419/consoleFull) for PR 4047 at commit [`6fd1f71`](https://github.com/apache/spark/commit/6fd1f718ccef2464601256a84e99523c1f7d033f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1484#discussion_r23877192 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.Experimental +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, Vector} +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.stat.Statistics +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * Chi Squared selector model. + * + * @param indices list of indices to select (filter) + */ +@Experimental +class ChiSqSelectorModel(indices: Array[Int]) extends VectorTransformer { + /** + * Applies transformation on a vector. + * + * @param vector vector to be transformed. + * @return transformed vector. + */ + override def transform(vector: Vector): Vector = { +Compress(vector, indices) + } +} + +/** + * :: Experimental :: + * Creates a ChiSquared feature selector. + * @param numTopFeatures number of features that selector will select + * (ordered by statistic value descending) + */ +@Experimental +class ChiSqSelector (val numTopFeatures: Int) { + + /** + * Returns a ChiSquared feature selector. + * + * @param data data used to compute the Chi Squared statistic. + */ + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.sortBy { case(res, _) = -res.statistic } + .take(numTopFeatures) + .map{ case(_, indices) = indices } +new ChiSqSelectorModel(indices) + } +} + +/** + * :: Experimental :: + * Filters features in a given vector + */ +@Experimental +object Compress { + /** + * Returns a vector with features filtered. + * Preserves the order of filtered features the same as their indices are stored. + * @param features vector + * @param filterIndices indices of features to filter + */ + def apply(features: Vector, filterIndices: Array[Int]): Vector = { +features match { + case SparseVector(size, indices, values) = +val filterMap = filterIndices.zipWithIndex.toMap --- End diff -- This is slow due to hash map creation and hash lookups. Since both arrays are order, we can use the one-catch-another approach to extract indices, for example, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L344 Btw, please use `ArrayBuilder` to build new index/value arrays, which doesn't have the boxing/unboxing issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3975] Added support for BlockMatrix add...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/4274#discussion_r23862043 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala --- @@ -237,4 +239,88 @@ class BlockMatrix( val localMat = toLocalMatrix() new BDM[Double](localMat.numRows, localMat.numCols, localMat.toArray) } + + /** Adds two block matrices together. The matrices must have the same size and matching +* `rowsPerBlock` and `colsPerBlock` values. */ + def add(other: BlockMatrix): BlockMatrix = { +require(numRows() == other.numRows(), Both matrices must have the same number of rows. + + sA.numRows: ${numRows()}, B.numRows: ${other.numRows()}) +require(numCols() == other.numCols(), Both matrices must have the same number of columns. + + sA.numCols: ${numCols()}, B.numCols: ${other.numCols()}) +if (checkPartitioning(other, OperationNames.add)) { + val addedBlocks = blocks.cogroup(other.blocks, partitioner). +map { case ((blockRowIndex, blockColIndex), (a, b)) = + if (a.isEmpty) { +new MatrixBlock((blockRowIndex, blockColIndex), b.head) + } else if (b.isEmpty) { +new MatrixBlock((blockRowIndex, blockColIndex), a.head) + } else { +val result = a.head.toBreeze + b.head.toBreeze +new MatrixBlock((blockRowIndex, blockColIndex), Matrices.fromBreeze(result)) + } + } + new BlockMatrix(addedBlocks, rowsPerBlock, colsPerBlock, numRows(), numCols()) +} else { + throw new SparkException( +Cannot add matrices with non-matching partitioners) +} + } + + /** Left multiplies this [[BlockMatrix]] to `other`, another [[BlockMatrix]]. The `colsPerBlock` +* of this matrix must equal the `rowsPerBlock` of `other`. If `other` contains +* [[SparseMatrix]], they will have to be converted to a +* [[DenseMatrix]]. This may cause some performance issues until support for multiplying +* two sparse matrices is added. +*/ + def multiply(other: BlockMatrix): BlockMatrix = { +require(numCols() == other.numRows(), The number of columns of A and the number of rows + + sof B must be equal. A.numCols: ${numCols()}, B.numRows: ${other.numRows()}. If you + + sthink they should be equal, try setting the dimensions of A and B explicitly while + + sinitializing them.) +if (checkPartitioning(other, OperationNames.multiply)) { + val resultPartitioner = GridPartitioner(numRowBlocks, other.numColBlocks, +math.min(partitioner.numPartitions, other.partitioner.numPartitions)) + // Each block of A must be multiplied with the corresponding blocks in each column of B. + val flatA = blocks.flatMap{ case ((blockRowIndex, blockColIndex), block) = +Array.tabulate(other.numColBlocks)(j = ((blockRowIndex, j, blockColIndex), block)) + } + // Each block of B must be multiplied with the corresponding blocks in each row of A. + val flatB = other.blocks.flatMap{ case ((blockRowIndex, blockColIndex), block) = +Array.tabulate(numRowBlocks)(i = ((i, blockColIndex, blockRowIndex), block)) + } + val newBlocks: RDD[MatrixBlock] = flatA.join(flatB, resultPartitioner). +map { case ((blockRowIndex, blockColIndex, _), (mat1, mat2)) = + val C = mat2 match { +case dense: DenseMatrix = mat1.multiply(dense) +case sparse: SparseMatrix = mat1.multiply(sparse.toDense()) +case _ = throw new SparkException(sUnrecognized matrix type ${mat2.getClass}.) + } + ((blockRowIndex, blockColIndex), C.toBreeze) + }.reduceByKey(resultPartitioner, (a, b) = a + b).mapValues(Matrices.fromBreeze) --- End diff -- The only problem I see there is that we need to know whether the block is on the right or bottom edge to properly initialize a `ZeroValue`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4254#issuecomment-72277872 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26423/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4215#issuecomment-72283292 We should probably mention this feature in the Submitting Applications section of the docs: https://spark.apache.org/docs/latest/submitting-applications.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5400 [MLlib] Changed name of GaussianMix...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4290#issuecomment-72246727 [Test build #26412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26412/consoleFull) for PR 4290 at commit [`9c1534c`](https://github.com/apache/spark/commit/9c1534cd1c37953c1c592a2ce419eaee68dd853c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize Scala/Java test...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3564#issuecomment-72281399 [Test build #26430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26430/consoleFull) for PR 3564 at commit [`5ef856d`](https://github.com/apache/spark/commit/5ef856d9e4a6a4eb7d04a3f999a27a41618b1fd9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3803#issuecomment-72286100 [Test build #26435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26435/consoleFull) for PR 3803 at commit [`9a3715a`](https://github.com/apache/spark/commit/9a3715a1e6a71040d234da52bf848b0bb109a591). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-72278827 [Test build #26428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26428/consoleFull) for PR 4155 at commit [`594e41a`](https://github.com/apache/spark/commit/594e41abecf5a48084608ab20112f884f28fc920). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-72231925 [Test build #26406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26406/consoleFull) for PR 2847 at commit [`93f3280`](https://github.com/apache/spark/commit/93f3280fb1b9897f40b695683824aef619a5b8c2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user jackylk commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-72232107 @mengxr I have modified according to the comments, please review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5504] [sql] convertToCatalyst should su...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4295#issuecomment-72280446 [Test build #573 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/573/consoleFull) for PR 4295 at commit [`6b7276d`](https://github.com/apache/spark/commit/6b7276d44d0d578545f5c543de4167c0569fe4e1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1484#issuecomment-72281109 @avulanov Please check my inline comments on Compress and the Estimator/Model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-72232096 [Test build #26406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26406/consoleFull) for PR 2847 at commit [`93f3280`](https://github.com/apache/spark/commit/93f3280fb1b9897f40b695683824aef619a5b8c2). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class FPGrowthModel (val frequentPattern: Array[(Array[String], Long)]) extends Serializable ` * `class FPTree extends Serializable ` * `class FPTreeNode(val item: String, var count: Int) extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3519#issuecomment-72283444 [Test build #26433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26433/consoleFull) for PR 3519 at commit [`ded071c`](https://github.com/apache/spark/commit/ded071c51d0669eaedee062692c9accf13233c18). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5504] [sql] convertToCatalyst should su...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/4295#issuecomment-72283313 Awesome, thanks. LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23858127 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -267,10 +277,22 @@ object SparkSubmit { // In yarn-cluster mode, use yarn.Client as a wrapper around the user class if (isYarnCluster) { childMainClass = org.apache.spark.deploy.yarn.Client - if (args.primaryResource != SPARK_INTERNAL) { -childArgs += (--jar, args.primaryResource) + if (args.isPython) {// yarn-cluster mode for python application + val primaryResourceLocalPath = new Path(args.primaryResource) +childArgs += (--primaryResource, primaryResourceLocalPath.getName) +val pyFilesLocalNames:String = if (args.pyFiles != null) { + args.pyFiles.split(,).map { p = (new Path(p)).getName }.mkString(,) +} else { + null +} +childArgs += (--py-files, pyFilesLocalNames.toString) --- End diff -- No need for `toString`, this is already a string. Also, can we avoid adding this arg at all instead of using null? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user brkyvz commented on the pull request: https://github.com/apache/spark/pull/4215#issuecomment-72283499 I will add documentation during the QA period --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23879255 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -431,6 +458,155 @@ object SparkSubmit { } } +/** Provides utility functions to be used inside SparkSubmit. */ +private[spark] object SparkSubmitUtils extends Logging { + + // Directories for caching downloads through ivy and storing the jars when maven coordinates are + // supplied to spark-submit + private var PACKAGES_DIRECTORY: File = null + + /** + * Represents a Maven Coordinate + * @param groupId the groupId of the coordinate + * @param artifactId the artifactId of the coordinate + * @param version the version of the coordinate + */ + private[spark] case class MavenCoordinate(groupId: String, artifactId: String, version: String) + + /** + * Resolves any dependencies that were supplied through maven coordinates + * @param coordinates Comma-delimited string of maven coordinates + * @param remoteRepos Comma-delimited string of remote repositories other than maven central + * @param ivyPath The path to the local ivy repository + * @return The comma-delimited path to the jars of the given maven artifacts including their + * transitive dependencies + */ + private[spark] def resolveMavenCoordinates( + coordinates: String, + remoteRepos: String, + ivyPath: String, + isTest: Boolean = false): String = { +if (coordinates == null || coordinates.trim.isEmpty) { + +} else { + val artifacts = coordinates.split(,).map { p = +val splits = p.split(:) +require(splits.length == 3, sProvided Maven Coordinates must be in the form + + s'groupId:artifactId:version'. The coordinate provided is: $p) +require(splits(0) != null splits(0).trim.nonEmpty, sThe groupId cannot be null or + + sbe whitespace. The groupId provided is: ${splits(0)}) +require(splits(1) != null splits(1).trim.nonEmpty, sThe artifactId cannot be null or + + sbe whitespace. The artifactId provided is: ${splits(1)}) +require(splits(2) != null splits(2).trim.nonEmpty, sThe version cannot be null or + + sbe whitespace. The version provided is: ${splits(2)}) +new MavenCoordinate(splits(0), splits(1), splits(2)) + } + // Default configuration name for ivy + val conf = default --- End diff -- Could you rename this to something like `ivyConfName`? By itself, `conf` is kind of ambiguous to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/3803#issuecomment-72285668 @JoshRosen I finished the refactored tests and added better handling of the `getBytes` based on your suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: remove redundant field childOutput from exec...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/4291#issuecomment-72240710 Hi Kai, mind tagging this [SQL] so it can get properly sorted? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4216#issuecomment-72247522 [Test build #26413 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26413/consoleFull) for PR 4216 at commit [`bf696ff`](https://github.com/apache/spark/commit/bf696ff0b7135883e53e5fb275b4afa0db6c4a4a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-72265194 [Test build #26419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26419/consoleFull) for PR 4047 at commit [`6fd1f71`](https://github.com/apache/spark/commit/6fd1f718ccef2464601256a84e99523c1f7d033f). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class Tokenizer(sc: SparkContext, stopwordFile: String) extends Serializable ` * ` class EMOptimizer(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5486] Added validate method to BlockMat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4279 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4254#issuecomment-72265723 [Test build #26420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26420/consoleFull) for PR 4254 at commit [`f292f31`](https://github.com/apache/spark/commit/f292f31309201ed01186a221824675bed84a6f17). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-72265733 [Test build #26421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26421/consoleFull) for PR 4047 at commit [`1db89e2`](https://github.com/apache/spark/commit/1db89e2964aee34f9c33300be271dde41d61a782). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org