[GitHub] spark pull request: [SPARK-5478][UI][Minor] Add missing right pare...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4267 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5492. Thread statistics can break with o...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4305#issuecomment-72417193 LGTM pending tests. Thanks, Sandy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-72416892 @shivaram In the weakly typed API, fit() will take a DataFrame (containing attributes info) + Params. In the strongly typed API, train() would take an RDD[LabeledPoint], separate attributes info, + Params. Since the weakly typed API takes Params, it would be best not to duplicate the attributes info in the Params. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5212][SQL] Add support of schema-less, ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4014#discussion_r23910289 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ScriptTransformation.scala --- @@ -25,9 +25,18 @@ import org.apache.spark.sql.catalyst.expressions.{Attribute, Expression} * @param input the set of expression that should be passed to the script. * @param script the command that should be executed. * @param output the attributes that are produced by the script. + * @param ioschema the input and output schema applied in the execution of the script. */ case class ScriptTransformation( input: Seq[Expression], script: String, output: Seq[Attribute], -child: LogicalPlan) extends UnaryNode +child: LogicalPlan, +ioschema: Option[ScriptInputOutputSchema]) extends UnaryNode --- End diff -- In Hive case, it is not. But I think it may be for other cases? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4687. [WIP] Add an addDirectory API
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3670#issuecomment-72416125 [Test build #26496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26496/consoleFull) for PR 3670 at commit [`21504f9`](https://github.com/apache/spark/commit/21504f9381fc7c73486dfdbc51be023023213e91). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5492. Thread statistics can break with o...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4305#issuecomment-72416110 [Test build #26495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26495/consoleFull) for PR 4305 at commit [`b7d4497`](https://github.com/apache/spark/commit/b7d4497cf3a62d5c289a5b0e31148619162d2e14). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5212][SQL] Add support of schema-less, ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4014#discussion_r23910154 --- Diff: sql/hive/v0.12.0/src/main/scala/org/apache/spark/sql/hive/Shim12.scala --- @@ -241,8 +241,14 @@ private[hive] object HiveShim { Decimal(hdoi.getPrimitiveJavaObject(data).bigDecimalValue()) } } + + implicit def prepareWritable(shimW: ShimWritable): Writable = { +shimW.writable + } } +case class ShimWritable(writable: Writable) --- End diff -- If we skip `ShimWriteable`, we then need to remove `implicit` from `prepareWriteable` and explicitly call it to do the fixing. Is it better? If so, I can do it in this way. It does not break Hive 12 because we just pass the underlying writable object without touching it. We only do the fixing on Hive 13. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5492. Thread statistics can break with o...
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/4305 SPARK-5492. Thread statistics can break with older Hadoop versions You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-5492 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4305.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4305 commit b7d4497cf3a62d5c289a5b0e31148619162d2e14 Author: Sandy Ryza Date: 2015-02-02T07:29:27Z SPARK-5492. Thread statistics can break with older Hadoop versions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4233#discussion_r23910101 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -68,6 +79,65 @@ class LogisticRegressionModel ( case None => score } } + + override def save(sc: SparkContext, path: String): Unit = { +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Create JSON metadata. +val metadata = LogisticRegressionModel.Metadata( + clazz = this.getClass.getName, version = Exportable.latestVersion) +val metadataRDD: DataFrame = sc.parallelize(Seq(metadata)) +metadataRDD.toJSON.saveAsTextFile(path + "/metadata") +// Create Parquet data. +val data = LogisticRegressionModel.Data(weights, intercept, threshold) +val dataRDD: DataFrame = sc.parallelize(Seq(data)) +dataRDD.saveAsParquetFile(path + "/data") + } +} + +object LogisticRegressionModel extends Importable[LogisticRegressionModel] { + + private case class Metadata(clazz: String, version: String) + + private case class Data(weights: Vector, intercept: Double, threshold: Option[Double]) + + override def load(sc: SparkContext, path: String): LogisticRegressionModel = { +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Load JSON metadata. +val metadataRDD = sqlContext.jsonFile(path + "/metadata") --- End diff -- (I guess these are conflicting since using DataFrame and toJSON will mean 1 record per text file line, but that's OK with me.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72415530 [Test build #26490 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26490/consoleFull) for PR 3976 at commit [`67f8cee`](https://github.com/apache/spark/commit/67f8cee9e25b5bd05c0252705b1f67cb63b0fa01). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72415533 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26490/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5512][Mllib] Run the PIC algorithm with...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4301#issuecomment-72415114 [Test build #26494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26494/consoleFull) for PR 4301 at commit [`19cf94e`](https://github.com/apache/spark/commit/19cf94ecfd6d879cbceb52f0abc0a32461e7d871). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5512][Mllib] Run the PIC algorithm with...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4301#issuecomment-72414811 @mengxr I think it is better to keep both and leave it as an option users can switch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4215#issuecomment-72414442 [Test build #26493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26493/consoleFull) for PR 4215 at commit [`c08dc9f`](https://github.com/apache/spark/commit/c08dc9fb8d85a7d9a58f980af99687e13c4d766a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4215#issuecomment-72414128 [Test build #26492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26492/consoleFull) for PR 4215 at commit [`3ada19a`](https://github.com/apache/spark/commit/3ada19ac3d569e4d5af35c309436be36ba211f94). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4233#discussion_r23909298 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -68,6 +79,65 @@ class LogisticRegressionModel ( case None => score } } + + override def save(sc: SparkContext, path: String): Unit = { +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Create JSON metadata. +val metadata = LogisticRegressionModel.Metadata( + clazz = this.getClass.getName, version = Exportable.latestVersion) +val metadataRDD: DataFrame = sc.parallelize(Seq(metadata)) +metadataRDD.toJSON.saveAsTextFile(path + "/metadata") +// Create Parquet data. +val data = LogisticRegressionModel.Data(weights, intercept, threshold) +val dataRDD: DataFrame = sc.parallelize(Seq(data)) +dataRDD.saveAsParquetFile(path + "/data") + } +} + +object LogisticRegressionModel extends Importable[LogisticRegressionModel] { + + private case class Metadata(clazz: String, version: String) + + private case class Data(weights: Vector, intercept: Double, threshold: Option[Double]) + + override def load(sc: SparkContext, path: String): LogisticRegressionModel = { +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Load JSON metadata. +val metadataRDD = sqlContext.jsonFile(path + "/metadata") --- End diff -- Also, I get the motivation for using json4s directly rather than going through DataFrame and DataFrame.toJSON in terms of reducing dependencies. However, I like the idea of using DataFrame since it will be helpful when we add other types of metadata, such as info about each feature. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3833#issuecomment-72413396 [Test build #26491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26491/consoleFull) for PR 3833 at commit [`4ce4d33`](https://github.com/apache/spark/commit/4ce4d33f6d8119f4b68d6e436a398e0f975d9b40). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3715#discussion_r23909226 --- Diff: core/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala --- @@ -23,11 +23,21 @@ import org.scalatest.FunSuite class PythonRDDSuite extends FunSuite { -test("Writing large strings to the worker") { -val input: List[String] = List("a"*10) -val buffer = new DataOutputStream(new ByteArrayOutputStream) -PythonRDD.writeIteratorToStream(input.iterator, buffer) -} + test("Writing large strings to the worker") { +val input: List[String] = List("a"*10) +val buffer = new DataOutputStream(new ByteArrayOutputStream) +PythonRDD.writeIteratorToStream(input.iterator, buffer) + } -} + test("Handle nulls gracefully") { +val buffer = new DataOutputStream(new ByteArrayOutputStream) +PythonRDD.writeIteratorToStream(List("a", null).iterator, buffer) +PythonRDD.writeIteratorToStream(List(null, "a").iterator, buffer) +PythonRDD.writeIteratorToStream(List("a".getBytes, null).iterator, buffer) +PythonRDD.writeIteratorToStream(List(null, "a".getBytes).iterator, buffer) +PythonRDD.writeIteratorToStream(List((null, null), ("a", null), (null, "b")).iterator, buffer) --- End diff -- This is a test in Python to verify that (but not cover all the cases). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4233#discussion_r23909199 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -68,6 +79,65 @@ class LogisticRegressionModel ( case None => score } } + + override def save(sc: SparkContext, path: String): Unit = { +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Create JSON metadata. +val metadata = LogisticRegressionModel.Metadata( + clazz = this.getClass.getName, version = Exportable.latestVersion) +val metadataRDD: DataFrame = sc.parallelize(Seq(metadata)) +metadataRDD.toJSON.saveAsTextFile(path + "/metadata") +// Create Parquet data. +val data = LogisticRegressionModel.Data(weights, intercept, threshold) +val dataRDD: DataFrame = sc.parallelize(Seq(data)) +dataRDD.saveAsParquetFile(path + "/data") + } +} + +object LogisticRegressionModel extends Importable[LogisticRegressionModel] { + + private case class Metadata(clazz: String, version: String) + + private case class Data(weights: Vector, intercept: Double, threshold: Option[Double]) + + override def load(sc: SparkContext, path: String): LogisticRegressionModel = { +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Load JSON metadata. +val metadataRDD = sqlContext.jsonFile(path + "/metadata") --- End diff -- That's not quite my question: I think the confusion is mixing "row" (line) in a text file vs "row" (or record) in an RDD. How about we store the metadata in a single record in an RDD, but we print that RDD as multi-line JSON to a single text file? It will be easier for humans to read and will be easy to load as a single record as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Multinomial Logistic Regre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/3833#discussion_r23909157 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -61,20 +79,58 @@ class LogisticRegressionModel ( override protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double) = { -val margin = weightMatrix.toBreeze.dot(dataMatrix.toBreeze) + intercept -val score = 1.0 / (1.0 + math.exp(-margin)) -threshold match { - case Some(t) => if (score > t) 1.0 else 0.0 - case None => score +// If dataMatrix and weightMatrix have the same dimension, it's binary logistic regression. +if (dataMatrix.size == weightMatrix.size) { + val margin = dot(weights, dataMatrix) + intercept + val score = 1.0 / (1.0 + math.exp(-margin)) + threshold match { +case Some(t) => if (score > t) 1.0 else 0.0 +case None => score + } +} else { + val dataWithBiasSize = weightMatrix.size / (nClasses - 1) + val dataWithBias = if (dataWithBiasSize == dataMatrix.size) { +dataMatrix + } else { +assert(dataMatrix.size + 1 == dataWithBiasSize) +MLUtils.appendBias(dataMatrix) --- End diff -- This can be done without creating the temp matrix w. See the updated PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5208][DOC] Add more documentation to Ne...
Github user sarutak closed the pull request at: https://github.com/apache/spark/pull/4012 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5208][DOC] Add more documentation to Ne...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/4012#issuecomment-72411859 O.K, I'll close. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72410777 [Test build #26490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26490/consoleFull) for PR 3976 at commit [`67f8cee`](https://github.com/apache/spark/commit/67f8cee9e25b5bd05c0252705b1f67cb63b0fa01). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-72410655 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26489/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-72410648 [Test build #26489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26489/consoleFull) for PR 4289 at commit [`afc7da5`](https://github.com/apache/spark/commit/afc7da53be4b7bcb9cd5ce8d72b6855544b96596). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class Rating[@specialized(Int, Long) ID](user: ID, item: ID, rating: Float)` * `class StandardScalerModel (` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-72410436 @mengxr - Would the Attribute be per data point or something that is set once per the algorithm ? The latter sounds like something the `ParamMap` should be able to handle. If its per element, then its like another column in the table ? Sorry if I missing something, but it'll be great if you could give an example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/3715#discussion_r23908186 --- Diff: core/src/test/scala/org/apache/spark/api/python/PythonRDDSuite.scala --- @@ -23,11 +23,21 @@ import org.scalatest.FunSuite class PythonRDDSuite extends FunSuite { -test("Writing large strings to the worker") { -val input: List[String] = List("a"*10) -val buffer = new DataOutputStream(new ByteArrayOutputStream) -PythonRDD.writeIteratorToStream(input.iterator, buffer) -} + test("Writing large strings to the worker") { +val input: List[String] = List("a"*10) +val buffer = new DataOutputStream(new ByteArrayOutputStream) +PythonRDD.writeIteratorToStream(input.iterator, buffer) + } -} + test("Handle nulls gracefully") { +val buffer = new DataOutputStream(new ByteArrayOutputStream) +PythonRDD.writeIteratorToStream(List("a", null).iterator, buffer) +PythonRDD.writeIteratorToStream(List(null, "a").iterator, buffer) +PythonRDD.writeIteratorToStream(List("a".getBytes, null).iterator, buffer) +PythonRDD.writeIteratorToStream(List(null, "a".getBytes).iterator, buffer) +PythonRDD.writeIteratorToStream(List((null, null), ("a", null), (null, "b")).iterator, buffer) --- End diff -- This issue still has not been addressed. There are not asserts to check whether the nulls can be read back properly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-5406][MLlib] LocalLAPACK mode in RowMat...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/4200#issuecomment-72410109 Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Disabling Utils.chmod700 for Windows
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/4299#issuecomment-72409984 Hey @MartinWeindel any ideas why the diff for this PR is almost 2k lines? Is your IDE changing the line end characters somehow? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5501][SPARK-5420][SQL] Write suppo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4294#issuecomment-72409068 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26488/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5501][SPARK-5420][SQL] Write suppo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4294#issuecomment-72409064 [Test build #26488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26488/consoleFull) for PR 4294 at commit [`9203ec2`](https://github.com/apache/spark/commit/9203ec2f5bfca2cdedb7b9042996db5d59edeb34). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class FPGrowthModel(val freqItemsets: RDD[(Array[String], Long)]) extends Serializable` * ` class Node[T](val parent: Node[T]) extends Serializable ` * `protected[sql] class DDLException(message: String) extends Exception(message)` * `trait TableScan extends BaseRelation ` * `trait PrunedScan extends BaseRelation ` * `trait PrunedFilteredScan extends BaseRelation ` * `trait CatalystScan extends BaseRelation ` * `trait InsertableRelation extends BaseRelation ` * `case class CreateMetastoreDataSourceAsSelect(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4130 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4059#issuecomment-72408935 Yes, Array should work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4130#issuecomment-72408916 I can merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add a config option to print DAG.
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4257#issuecomment-72408451 @rxin I have noticed that very few users know about `toDebugString`. Maybe we should open a JIRA to add better documentation for that function (i.e. discuss it in the programming guide). Overall, I agree with you and @ScrapCodes in that I'm not sure this particular flag is super useufl. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4130#issuecomment-72408322 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5208][DOC] Add more documentation to Ne...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4012#issuecomment-72408184 Okay @sarutak can you close this issue then? Looks like we intentionally left these out for now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23907239 --- Diff: core/pom.xml --- @@ -225,6 +225,16 @@ test + org.apache.ivy + ivy + ${ivy.version} + + + oro --- End diff -- @brkyvz add a comment here: ``` ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3996]: Shade Jetty in Spark deliverable...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4285 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3996]: Shade Jetty in Spark deliverable...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4285#issuecomment-72407319 Okay - let's try this for take 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/3798#discussion_r23907051 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala --- @@ -144,4 +150,249 @@ object KafkaUtils { createStream[K, V, U, T]( jssc.ssc, kafkaParams.toMap, Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel) } + + /** A batch-oriented interface for consuming from Kafka. + * Starting and ending offsets are specified in advance, + * so that you can control exactly-once semantics. + * @param sc SparkContext object + * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";> + * configuration parameters. + * Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s), + * NOT zookeeper servers, specified in host1:port1,host2:port2 form. + * @param batch Each OffsetRange in the batch corresponds to a + * range of offsets for a given Kafka topic/partition + */ + @Experimental + def createRDD[ +K: ClassTag, +V: ClassTag, +U <: Decoder[_]: ClassTag, +T <: Decoder[_]: ClassTag, +R: ClassTag] ( --- End diff -- Good catch! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-72406797 [Test build #26489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26489/consoleFull) for PR 4289 at commit [`afc7da5`](https://github.com/apache/spark/commit/afc7da53be4b7bcb9cd5ce8d72b6855544b96596). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user FlytxtRnD commented on the pull request: https://github.com/apache/spark/pull/4059#issuecomment-72406403 So I will go with the current approach. I tried to change Array to ArrayBuffer but is ending up in exceptions. So can I go with array itself ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/3798#discussion_r23906731 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.kafka + +import scala.reflect.{classTag, ClassTag} + +import org.apache.spark.{Logging, Partition, SparkContext, SparkException, TaskContext} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.NextIterator + +import java.util.Properties +import kafka.api.{FetchRequestBuilder, FetchResponse} +import kafka.common.{ErrorMapping, TopicAndPartition} +import kafka.consumer.{ConsumerConfig, SimpleConsumer} +import kafka.message.{MessageAndMetadata, MessageAndOffset} +import kafka.serializer.Decoder +import kafka.utils.VerifiableProperties + +/** + * A batch-oriented interface for consuming from Kafka. + * Starting and ending offsets are specified in advance, + * so that you can control exactly-once semantics. + * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";> + * configuration parameters. + * Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s), + * NOT zookeeper servers, specified in host1:port1,host2:port2 form. + * @param batch Each KafkaRDDPartition in the batch corresponds to a + * range of offsets for a given Kafka topic/partition + * @param messageHandler function for translating each message into the desired type + */ +private[spark] +class KafkaRDD[ + K: ClassTag, + V: ClassTag, + U <: Decoder[_]: ClassTag, + T <: Decoder[_]: ClassTag, + R: ClassTag] private[spark] ( +sc: SparkContext, +kafkaParams: Map[String, String], +private[spark] val batch: Array[KafkaRDDPartition], --- End diff -- Actually, this not the desired way to create RDDs. The partition objects are generated created by the RDD itself, and not provided from outside. Although this is not a written hard rule, it is generally the norm followed by all types of RDDs. Example: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala#L65 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Improve DataFrame API error reporting
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4296#issuecomment-72405811 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26487/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4059#issuecomment-72405813 They are not attributes but public methods. Did you try `mu()` and `sigma()`? I think the current approach looks good except minor issues commented. We can try other approaches in a later PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Improve DataFrame API error reporting
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4296#issuecomment-72405807 [Test build #26487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26487/consoleFull) for PR 4296 at commit [`17f6bae`](https://github.com/apache/spark/commit/17f6bae783362076c977aae834792dc94cffca94). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait Column extends DataFrame with ExpressionApi ` * `class ColumnName(name: String) extends IncomputableColumn(name) ` * `trait DataFrame extends DataFrameSpecificApi with RDDApi[Row] ` * `class GroupedDataFrame protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expression])` * ` protected[sql] class QueryExecution(val logical: LogicalPlan) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5470][Core]use defaultClassLoader to lo...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/4258#issuecomment-72405667 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: PCA wrapper for easy transform vectors
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4304#issuecomment-72405573 @catap This is nice to have. Could you follow the steps in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for contributing to Spark? For example, you need to create a JIRA (and get assigned) and put the JIRA number in the PR title. For the public APIs, please follow other transformers under `mllib.feature`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4294#issuecomment-72405506 [Test build #26488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26488/consoleFull) for PR 4294 at commit [`9203ec2`](https://github.com/apache/spark/commit/9203ec2f5bfca2cdedb7b9042996db5d59edeb34). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Disabling Utils.chmod700 for Windows
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4299#issuecomment-72405340 [Test build #26483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26483/consoleFull) for PR 4299 at commit [`fe2740b`](https://github.com/apache/spark/commit/fe2740bef2320195a64fbaa7f29d6493cc6337c8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Disabling Utils.chmod700 for Windows
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4299#issuecomment-72405341 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26483/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Improve DataFrame API error reporting
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4296#issuecomment-72405245 [Test build #26487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26487/consoleFull) for PR 4296 at commit [`17f6bae`](https://github.com/apache/spark/commit/17f6bae783362076c977aae834792dc94cffca94). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-72405143 About the metadata, I'm thinking of creating ML Attribute/VectorAttribute classes that stores feature information, which can be load from/saved to Spark SQL's metadata. It is similar to Weka's Attribute implementation. Since `RDD[LabeledPoint]` doesn't carry this extra information, could we make ML attributes as an input argument to the `train` method? For example ~~~ def train(dataset: RDD[LabeledPoint], attributes: (Attribute, VectorAttribute)) ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: PCA wrapper for easy transform vectors
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4304#issuecomment-72404997 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: PCA wrapper for easy transform vectors
GitHub user catap opened a pull request: https://github.com/apache/spark/pull/4304 PCA wrapper for easy transform vectors I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure. Example of usage: ``` import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.feature.PCA val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) val pca = PCA.create(training.first().features.size/2, data.map(_.features)) val training_pca = training.map(p => p.copy(features = pca.transform(p.features))) val test_pca = test.map(p => p.copy(features = pca.transform(p.features))) val numIterations = 100 val model = LinearRegressionWithSGD.train(training, numIterations) val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations) val valuesAndPreds = test.map { point => val score = model.predict(point.features) (score, point.label) } val valuesAndPreds_pca = test_pca.map { point => val score = model_pca.predict(point.features) (score, point.label) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean() println("Mean Squared Error = " + MSE) println("PCA Mean Squared Error = " + MSE_pca) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/catap/spark pca Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4304.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4304 commit c71af4ad718be60e231bb10e39211f1acb1b04ab Author: Kirill A. Korinskiy Date: 2015-02-02T04:24:52Z PCA wrapper for easy transform vectors I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure. Example of usage: ``` import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.feature.PCA val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) val pca = PCA.create(training.first().features.size/2, data.map(_.features)) val training_pca = training.map(p => p.copy(features = pca.transform(p.features))) val test_pca = test.map(p => p.copy(features = pca.transform(p.features))) val numIterations = 100 val model = LinearRegressionWithSGD.train(training, numIterations) val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations) val valuesAndPreds = test.map { point => val score = model.predict(point.features) (score, point.label) } val valuesAndPreds_pca = test_pca.map { point => val score = model_pca.predict(point.features) (score, point.label) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean() println("Mean Squared Error = " + MSE) println("PCA Mean Squared Error = " + MSE_pca) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user FlytxtRnD commented on the pull request: https://github.com/apache/spark/pull/4059#issuecomment-72404786 Instead of passing mu & sigma as arrays, I tried to directly pass "gaussians "(Array[MultivariateGaussian]) from PythonMLLibAPI. But I was not able to access the attributes of the MultivariateGaussian class object in python. Then I converted "gaussians" to 2 arrays of mu and sigma and returned to python. Which method is good? And is it possible to access the attributes mu & sigma in python by passing "gaussians" directly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2847 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-72404067 LGTM. Merged into master. Thanks!! (The failed test is a known flakey test. All relevant tests passed.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5324][SQL] Results of describe can't be...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4249#issuecomment-72404020 [Test build #26485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26485/consoleFull) for PR 4249 at commit [`11559ae`](https://github.com/apache/spark/commit/11559ae5b8356e0b50b1647af1623b04ca42523a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5324][SQL] Results of describe can't be...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4249#issuecomment-72404022 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26485/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-72403864 [Test build #26486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26486/consoleFull) for PR 2847 at commit [`bee3093`](https://github.com/apache/spark/commit/bee3093daa4c8473a9f531c5fdee353c06cd1bf0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class FPGrowthModel(val freqItemsets: RDD[(Array[String], Long)]) extends Serializable` * ` class Node[T](val parent: Node[T]) extends Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4233#discussion_r23906177 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -68,6 +79,65 @@ class LogisticRegressionModel ( case None => score } } + + override def save(sc: SparkContext, path: String): Unit = { +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Create JSON metadata. +val metadata = LogisticRegressionModel.Metadata( + clazz = this.getClass.getName, version = Exportable.latestVersion) +val metadataRDD: DataFrame = sc.parallelize(Seq(metadata)) +metadataRDD.toJSON.saveAsTextFile(path + "/metadata") +// Create Parquet data. +val data = LogisticRegressionModel.Data(weights, intercept, threshold) +val dataRDD: DataFrame = sc.parallelize(Seq(data)) +dataRDD.saveAsParquetFile(path + "/data") + } +} + +object LogisticRegressionModel extends Importable[LogisticRegressionModel] { + + private case class Metadata(clazz: String, version: String) + + private case class Data(weights: Vector, intercept: Double, threshold: Option[Double]) + + override def load(sc: SparkContext, path: String): LogisticRegressionModel = { +val sqlContext = new SQLContext(sc) +import sqlContext._ + +// Load JSON metadata. +val metadataRDD = sqlContext.jsonFile(path + "/metadata") --- End diff -- We want to use RDD to avoid talking to fs directly. If you use json4s, you can render single line JSON easily: ~~~ import org.json4s._ import org.json4s.JsonDSL._ import org.json4s.jackson.JsonMethods._ val json = ("a\n" -> "b\n") println(compact(render(json))) ~~~ outputs ~~~ {"a\n":"b\n"} ~~~ So the metadata won't span multiple lines. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-72403870 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26486/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4943][SPARK-5251][SQL] Allow table name...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/4062#issuecomment-72403143 ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72403008 [Test build #26482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26482/consoleFull) for PR 3976 at commit [`0319ae3`](https://github.com/apache/spark/commit/0319ae328b2db694684ea586cbb7d49fb2b487c7). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72403015 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26482/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-5406][MLlib] LocalLAPACK mode in RowMat...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4200#issuecomment-72402760 The changes look good to me. We may want to investigate more on the limits, but the current setting is certainly better than master. I've merged it. Thanks for testing! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-5406][MLlib] LocalLAPACK mode in RowMat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4200 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5212][SQL] Add support of schema-less, ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/4014#issuecomment-72402698 Thanks for working on this! It would be great if this could be updated soon so we can include it in 1.3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72402028 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26480/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72402026 [Test build #26480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26480/consoleFull) for PR 3976 at commit [`2385ef6`](https://github.com/apache/spark/commit/2385ef679638fcb0b544a3de7744c9f4f2c242f0). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Improve DataFrame API error reporting
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/4296#discussion_r23905446 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -28,6 +28,21 @@ import scala.language.postfixOps class DataFrameSuite extends QueryTest { import org.apache.spark.sql.TestData._ + test("analysis error should be eagerly reported") { +intercept[Exception] { testData.select('nonExistentName) } +intercept[Exception] { + testData.groupBy('key).agg(Map("nonExistentName" -> "sum")) +} +intercept[Exception] { + testData.groupBy("nonExistentName").agg(Map("key" -> "sum")) --- End diff -- Why isn't this `(String, String)*`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5278][SQL] complete the check of ambigu...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/4068#discussion_r23905429 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -285,11 +285,22 @@ class Analyzer(catalog: Catalog, result // Resolve field names using the resolver. - case f @ GetField(child, fieldName) if !f.resolved && child.resolved => + case f @ GetField(child, fieldName) if child.resolved => child.dataType match { case StructType(fields) => -val resolvedFieldName = fields.map(_.name).find(resolver(_, fieldName)) -resolvedFieldName.map(n => f.copy(fieldName = n)).getOrElse(f) +val actualField = fields.filter(f => resolver(f.name, fieldName)) +if (actualField.length == 0) { + sys.error( +s"No such struct field $fieldName in ${fields.map(_.name).mkString(", ")}") --- End diff -- ping @marmbrus --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user lianhuiwang commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72401201 for python application, if SPARK_HOME of submission node is different from the nodeManager, so it can not work in my test. example:submission node's version is 1.2, but spark's version in nodemanger is 1.1, that is can not work now. i think there is other problem that is not belong to this PR,because in yarn client mode it is also exist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-72401147 [Test build #26486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26486/consoleFull) for PR 2847 at commit [`bee3093`](https://github.com/apache/spark/commit/bee3093daa4c8473a9f531c5fdee353c06cd1bf0). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-5406][MLlib] LocalLAPACK mode in RowMat...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/4200#issuecomment-72400944 @mengxr Sorry to disturb. I know you are probably quite busy with many PR in review. Can you please provide some comments if got a minute? I will close the PR if it's regarded as unnecessary for now~ Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-72400702 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26484/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-72400699 [Test build #26484 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26484/consoleFull) for PR 4289 at commit [`b1527d5`](https://github.com/apache/spark/commit/b1527d58349ccdc0b986705b93d7658822211994). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5324][SQL] Results of describe can't be...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4249#issuecomment-72400616 [Test build #26485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26485/consoleFull) for PR 4249 at commit [`11559ae`](https://github.com/apache/spark/commit/11559ae5b8356e0b50b1647af1623b04ca42523a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5404] [SQL] update the default statisti...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/4199#issuecomment-72400547 I don't think that I agree with this change. In general it is always safe to do a shuffle join where as a broadcast join could possible cause the driver to OOM. However, I'm worried that this change will make us faster for some workloads but possibly also unstable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5465] [SQL] Fixes filter push-down for ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4255 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-72400374 [Test build #26484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26484/consoleFull) for PR 4289 at commit [`b1527d5`](https://github.com/apache/spark/commit/b1527d58349ccdc0b986705b93d7658822211994). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5324][SQL] Results of describe can't be...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/4249#issuecomment-72400392 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5262] [SPARK-5244] [SQL] add coalesce i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4057 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5262] [SPARK-5244] [SQL] add coalesce i...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/4057#issuecomment-72400328 Thanks! Merged to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5515] Build fails with spark-ganglia-lg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4303#issuecomment-72400202 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26478/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5515] Build fails with spark-ganglia-lg...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4303#issuecomment-72400197 [Test build #26478 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26478/consoleFull) for PR 4303 at commit [`5cf455f`](https://github.com/apache/spark/commit/5cf455f08eae005d48b8420d7aeec30520bd30df). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class Rating[@specialized(Int, Long) ID](user: ID, item: ID, rating: Float)` * `class StandardScalerModel (` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Disabling Utils.chmod700 for Windows
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4299#issuecomment-72400107 [Test build #26483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26483/consoleFull) for PR 4299 at commit [`fe2740b`](https://github.com/apache/spark/commit/fe2740bef2320195a64fbaa7f29d6493cc6337c8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Disabling Utils.chmod700 for Windows
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/4299#issuecomment-72400044 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user lianhuiwang commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23905022 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -134,12 +136,29 @@ object SparkSubmit { } } +val isYarnCluster = clusterManager == YARN && deployMode == CLUSTER + +// Require all python files to be local, so we can add them to the PYTHONPATH +// when yarn-cluster, all python files can be non-local +if (args.isPython && !isYarnCluster) { + if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) { +SparkSubmit.printErrorAndExit( --- End diff -- if we move it to SparkSubmitArguments, we need to get clusterManager and deployMode. But this work has been did in SparkSubmit. there are some repeated works. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5196][SQL] Support `comment` in Create ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3999 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3976#issuecomment-72399690 @lianhuiwang what happens now if the submission node uses a different SPARK_HOME from the machines? Does it still work? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23904936 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala --- @@ -185,6 +192,7 @@ private[spark] class ClientArguments(args: Array[String], sparkConf: SparkConf) | --jar JAR_PATH Path to your application's JAR file (required in yarn-cluster | mode) | --class CLASS_NAME Name of your application's main class (required) + | --primary-py-fileA primary Python file --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23904930 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala --- @@ -81,6 +91,9 @@ class ApplicationMasterArguments(val args: Array[String]) { |Options: | --jar JAR_PATH Path to your application's JAR file | --class CLASS_NAME Name of your application's main class + | --primary-py-fileA primary Python file --- End diff -- The main python file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23904922 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -134,12 +136,29 @@ object SparkSubmit { } } +val isYarnCluster = clusterManager == YARN && deployMode == CLUSTER + +// Require all python files to be local, so we can add them to the PYTHONPATH +// when yarn-cluster, all python files can be non-local +if (args.isPython && !isYarnCluster) { + if (Utils.nonLocalPaths(args.primaryResource).nonEmpty) { +SparkSubmit.printErrorAndExit( --- End diff -- Also, not a big deal but I actually think this check belongs better in `SparkSubmitArguments`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5173]support python application running...
Github user lianhuiwang commented on a diff in the pull request: https://github.com/apache/spark/pull/3976#discussion_r23904929 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -430,6 +430,10 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") --- End diff -- ok, i got it. i will update it. thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4216#issuecomment-72399572 [Test build #26477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26477/consoleFull) for PR 4216 at commit [`42e5de4`](https://github.com/apache/spark/commit/42e5de43c26806fb36aced9bf70e23e2eadbac41). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class MasterStateResponse(` * `class LocalSparkCluster(` * ` * (4) the main class for the child` * ` case class BoundPortsResponse(actorPort: Int, webUIPort: Int, restPort: Option[Int])` * `class DriverStatusRequest extends SubmitRestProtocolRequest ` * `class DriverStatusResponse extends SubmitRestProtocolResponse ` * `class ErrorResponse extends SubmitRestProtocolResponse ` * `class KillDriverRequest extends SubmitRestProtocolRequest ` * `class KillDriverResponse extends SubmitRestProtocolResponse ` * ` throw new SubmitRestMissingFieldException("Main class must be set in submit request.")` * `class SubmitDriverRequest extends SubmitRestProtocolRequest ` * `class SubmitDriverResponse extends SubmitRestProtocolResponse ` * `class SubmitRestProtocolException(message: String, cause: Exception = null)` * `class SubmitRestMissingFieldException(message: String) extends SubmitRestProtocolException(message)` * `abstract class SubmitRestProtocolMessage ` * `abstract class SubmitRestProtocolRequest extends SubmitRestProtocolMessage ` * `abstract class SubmitRestProtocolResponse extends SubmitRestProtocolMessage ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/3798#discussion_r23904918 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala --- @@ -144,4 +150,249 @@ object KafkaUtils { createStream[K, V, U, T]( jssc.ssc, kafkaParams.toMap, Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel) } + + /** A batch-oriented interface for consuming from Kafka. + * Starting and ending offsets are specified in advance, + * so that you can control exactly-once semantics. + * @param sc SparkContext object + * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";> + * configuration parameters. + * Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s), + * NOT zookeeper servers, specified in host1:port1,host2:port2 form. + * @param batch Each OffsetRange in the batch corresponds to a + * range of offsets for a given Kafka topic/partition + */ + @Experimental + def createRDD[ +K: ClassTag, +V: ClassTag, +U <: Decoder[_]: ClassTag, +T <: Decoder[_]: ClassTag, +R: ClassTag] ( + sc: SparkContext, + kafkaParams: Map[String, String], + batch: Array[OffsetRange] + ): RDD[(K, V)] with HasOffsetRanges = { +val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message) +val kc = new KafkaCluster(kafkaParams) +val topics = batch.map(o => TopicAndPartition(o.topic, o.partition)).toSet +val leaderMap = kc.findLeaders(topics).fold( + errs => throw new SparkException(errs.mkString("\n")), + ok => ok +) +val rddParts = batch.zipWithIndex.map { case (o, i) => +val tp = TopicAndPartition(o.topic, o.partition) +val (host, port) = leaderMap(tp) +new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port) +}.toArray +new KafkaRDD[K, V, U, T, (K, V)](sc, kafkaParams, rddParts, messageHandler) + } + + /** A batch-oriented interface for consuming from Kafka. + * Starting and ending offsets are specified in advance, + * so that you can control exactly-once semantics. + * @param sc SparkContext object + * @param kafkaParams Kafka http://kafka.apache.org/documentation.html#configuration";> + * configuration parameters. + * Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s), + * NOT zookeeper servers, specified in host1:port1,host2:port2 form. + * @param batch Each OffsetRange in the batch corresponds to a + * range of offsets for a given Kafka topic/partition + * @param leaders Kafka leaders for each offset range in batch + * @param messageHandler function for translating each message into the desired type + */ + @Experimental + def createRDD[ +K: ClassTag, +V: ClassTag, +U <: Decoder[_]: ClassTag, +T <: Decoder[_]: ClassTag, +R: ClassTag] ( + sc: SparkContext, + kafkaParams: Map[String, String], + batch: Array[OffsetRange], + leaders: Array[Leader], + messageHandler: MessageAndMetadata[K, V] => R + ): RDD[R] with HasOffsetRanges = { +val leaderMap = leaders.map(l => (l.topic, l.partition) -> (l.host, l.port)).toMap +val rddParts = batch.zipWithIndex.map { case (o, i) => +val (host, port) = leaderMap((o.topic, o.partition)) +new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port) +}.toArray + +new KafkaRDD[K, V, U, T, R](sc, kafkaParams, rddParts, messageHandler) + } + + /** + * This stream can guarantee that each message from Kafka is included in transformations + * (as opposed to output actions) exactly once, even in most failure situations. + * + * Points to note: + * + * Failure Recovery - You must checkpoint this stream, or save offsets yourself and provide them + * as the fromOffsets parameter on restart. + * Kafka must have sufficient log retention to obtain messages after failure. + * + * Getting offsets from the stream - see programming guide + * +. * Zookeeper - This does not use Zookeeper to store offsets. For interop with Kafka monitors + * that depend on Zookeeper, you must store offsets in ZK yourself. + * + * End-to-end semantics - This does not guarantee that any output operation will push each record + * exactly once. To ensure end-to-end exactly-once semantics (that is, receiving exactly once and + * outputting exactly once), you have to either ensure that the output operation is
[GitHub] spark pull request: [SPARK-5388] Provide a stable application subm...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4216#issuecomment-72399576 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26477/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org