[GitHub] spark pull request: [SPARK-6223][SQL] Fix build warning- enable im...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4948#issuecomment-77841961 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6223][SQL] Fix build warning- enable im...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4948#issuecomment-77855093 As I mentioned, I don't think it's efficient to try to make changes like this one line at a time. There are a number of warnings like this, and other build warnings in general, we can resolve as one logical change. I have already created a set of fixes for this and will open a PR. https://github.com/apache/spark/pull/4908 https://github.com/apache/spark/pull/4900 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4950#discussion_r26036340 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -199,12 +199,12 @@ object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] { assert(formatVersion == thisFormatVersion) val rank = (metadata \ rank).extract[Int] val userFeatures = sqlContext.parquetFile(userPath(path)) -.map { case Row(id: Int, features: Seq[Double]) = - (id, features.toArray) +.map { case Row(id: Int, features: Seq[_]) = + (id, features.asInstanceOf[Seq[Double]].toArray) --- End diff -- Strangely, this is how the scala compiler wanted it. It doesn't like matching on a type with generics, since they are erased. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4950#issuecomment-77856603 [Test build #28392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28392/consoleFull) for PR 4950 at commit [`c67985b`](https://github.com/apache/spark/commit/c67985b01538a8e4ede806ce7e7b23af7a985a65). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6195] [SQL] Adds in-memory column type ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4938#issuecomment-77860204 [Test build #28391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28391/consoleFull) for PR 4938 at commit [`e08ab5b`](https://github.com/apache/spark/commit/e08ab5bc376cd67b79bc3eb195ec2a4302df2e37). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6177][MLlib] LDA should check partition...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4899#discussion_r26038573 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala --- @@ -174,6 +174,7 @@ object LDAExample { // Get dataset of document texts // One document per line in each text file. +// One partition each text file. Consider using coalesce as necessary. --- End diff -- As long as we're documenting this, let's edit this a bit more. It's not guaranteed that there will be a partition per text file. I'd say something more like: If the input consists of many small files, this can result in a large number of small partitions, which can degrade performance. In this case, consider using coalesce() to create fewer, larger partitions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [EC2] [SPARK-6188] Instance types can be misla...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4916 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6199] [SQL] Support CTE in HiveContext ...
Github user haiyangsea commented on a diff in the pull request: https://github.com/apache/spark/pull/4929#discussion_r26027360 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -287,6 +282,20 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll { mapData.collect().take(1).map(Row.fromTuple).toSeq) } + test(CTE feature) { +checkAnswer( + sql(with q1 as (select * from testData limit 10) select * from q1), + testData.take(10).toSeq) + +checkAnswer( + sql( +|with q1 as (select * from testData where key= '5'), --- End diff -- updated,thank you for your review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4950#discussion_r26036401 --- Diff: external/kafka/src/test/java/org/apache/spark/streaming/kafka/JavaKafkaRDDSuite.java --- @@ -19,23 +19,19 @@ import java.io.Serializable; import java.util.HashMap; -import java.util.HashSet; -import java.util.Arrays; - -import org.apache.spark.SparkConf; import scala.Tuple2; -import junit.framework.Assert; - import kafka.common.TopicAndPartition; import kafka.message.MessageAndMetadata; import kafka.serializer.StringDecoder; +import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; +import org.junit.Assert; --- End diff -- I did a little extra cleanup in these test files that wasn't strictly speaking related to a warning, like importing the modern JUnit classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6224][SQL] Also collect NamedExpression...
Github user viirya closed the pull request at: https://github.com/apache/spark/pull/4949 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6200] [SQL] Add a manager for dialects
Github user haiyangsea commented on a diff in the pull request: https://github.com/apache/spark/pull/4939#discussion_r26028022 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/commands.scala --- @@ -208,6 +209,87 @@ case class DescribeCommand( } /** + * :: DeveloperApi :: + */ +@DeveloperApi +case class ShowDialectsCommand( +isExtended: Boolean, +isCurrent: Boolean) extends RunnableCommand { --- End diff -- updated,thank you for your review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3830][MLlib] Implement genetic algorith...
Github user epahomov closed the pull request at: https://github.com/apache/spark/pull/2731 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4947#issuecomment-77836711 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28388/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3830][MLlib] Implement genetic algorith...
Github user epahomov commented on the pull request: https://github.com/apache/spark/pull/2731#issuecomment-77836797 My PR is too old for current architecture and I already found too much to improve in it. I'll do better and resubmit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4947#issuecomment-77836705 [Test build #28388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28388/consoleFull) for PR 4947 at commit [`48ab7f9`](https://github.com/apache/spark/commit/48ab7f984c75bcb8bfa9eec6330c67d9592b356e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6195] [SQL] Adds in-memory column type ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4938#issuecomment-77848860 [Test build #28391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28391/consoleFull) for PR 4938 at commit [`e08ab5b`](https://github.com/apache/spark/commit/e08ab5bc376cd67b79bc3eb195ec2a4302df2e37). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6224][SQL] Also collect NamedExpression...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/4949 [SPARK-6224][SQL] Also collect NamedExpressions in PhysicalOperation Currently in `PhysicalOperation`, only `Alias` expressions are collected. Similarly, `NamedExpression` can be collected for substitution. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 collect_namedexpr Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4949.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4949 commit cee75657aa30239c094b2b7d7671815b4adac5eb Author: Liang-Chi Hsieh vii...@gmail.com Date: 2015-03-09T12:57:12Z Also collect NamedExpressions in PhysicalOperation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6224][SQL] Also collect NamedExpression...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4949#issuecomment-77848861 [Test build #28390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28390/consoleFull) for PR 4949 at commit [`cee7565`](https://github.com/apache/spark/commit/cee75657aa30239c094b2b7d7671815b4adac5eb). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6224][SQL] Also collect NamedExpression...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4949#issuecomment-77853437 [Test build #28390 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28390/consoleFull) for PR 4949 at commit [`cee7565`](https://github.com/apache/spark/commit/cee75657aa30239c094b2b7d7671815b4adac5eb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/4950 SPARK-6225 [CORE] [SQL] [STREAMING] Resolve most build warnings, 1.3.0 edition Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-6225 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4950.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4950 commit c67985b01538a8e4ede806ce7e7b23af7a985a65 Author: Sean Owen so...@cloudera.com Date: 2015-03-09T13:49:53Z Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6199] [SQL] Support CTE in HiveContext ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4929#issuecomment-77834974 [Test build #28389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28389/consoleFull) for PR 4929 at commit [`0d56af4`](https://github.com/apache/spark/commit/0d56af4b80f0dc775cffcf400d882d5888ca717f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6223][SQL] Fix build warning- enable im...
GitHub user vinodkc opened a pull request: https://github.com/apache/spark/pull/4948 [SPARK-6223][SQL] Fix build warning- enable implicit value scala.language.existentials visible You can merge this pull request into a Git repository by running: $ git pull https://github.com/vinodkc/spark add_scala.language.existentials Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4948.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4948 commit 919ca8cda851efd6a35daaa8d4fb12dc22fdc749 Author: Vinod K C vinod...@huawei.com Date: 2015-03-09T14:38:19Z Fix Build warning- enable implicit value scala.language.existentials visible --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6095] [MLLIB] Support model save/load i...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/4911#issuecomment-77854374 @mengxr Yes, it make sense, I will try to implement the save/load operation in Python which do the same thing as in Scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6199] [SQL] Support CTE in HiveContext ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4929#issuecomment-77844025 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28389/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6199] [SQL] Support CTE in HiveContext ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4929#issuecomment-77844015 [Test build #28389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28389/consoleFull) for PR 4929 at commit [`0d56af4`](https://github.com/apache/spark/commit/0d56af4b80f0dc775cffcf400d882d5888ca717f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class With(child: LogicalPlan, subQueries: Map[String, Subquery]) extends UnaryNode ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4950#discussion_r26036283 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1104,7 +1104,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli if (!fs.exists(hadoopPath)) { throw new FileNotFoundException(sAdded file $hadoopPath does not exist.) } - val isDir = fs.isDirectory(hadoopPath) --- End diff -- In case you're wondering: no this wasn't one of those things deprecated in Hadoop 2.x; this was deprecated in 1.0.4 even! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6223][SQL] Fix build warning- enable im...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4948#discussion_r26036226 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala --- @@ -18,6 +18,7 @@ package org.apache.spark.sql.sources import scala.language.implicitConversions +import scala.language.existentials --- End diff -- I know it's trivial, but this is not ordered correctly, even. This was a change I included in a PR I was working on this weekend, and just submitted: https://github.com/apache/spark/pull/4948/files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4951#issuecomment-77861051 [Test build #28393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28393/consoleFull) for PR 4951 at commit [`dce7055`](https://github.com/apache/spark/commit/dce70553cb0e5c25d1bb0a415929eb5066af964a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4947#discussion_r26033358 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -158,7 +158,13 @@ private[spark] class KryoSerializerInstance(ks: KryoSerializer) extends Serializ override def serialize[T: ClassTag](t: T): ByteBuffer = { output.clear() -kryo.writeClassAndObject(output, t) +try { + kryo.writeClassAndObject(output, t) +} catch { + case e: KryoException if e.getMessage.startsWith(Buffer overflow) = +throw new SparkException(Serialization failed: Kryo buffer overflow. To avoid this, + --- End diff -- Original message (Available and requested size) in `KryoException` is useful too. Is it better to include original message too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6195] [SQL] Adds in-memory column type ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/4938#discussion_r26034326 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala --- @@ -107,24 +110,28 @@ private[sql] class GenericColumnAccessor(buffer: ByteBuffer) with NullableColumnAccessor private[sql] object ColumnAccessor { - def apply(buffer: ByteBuffer): ColumnAccessor = { + def apply(dataType: DataType, buffer: ByteBuffer): ColumnAccessor = { val dup = buffer.duplicate().order(ByteOrder.nativeOrder) -// The first 4 bytes in the buffer indicate the column type. -val columnTypeId = dup.getInt() - -columnTypeId match { - case INT.typeId = new IntColumnAccessor(dup) - case LONG.typeId = new LongColumnAccessor(dup) - case FLOAT.typeId = new FloatColumnAccessor(dup) - case DOUBLE.typeId= new DoubleColumnAccessor(dup) - case BOOLEAN.typeId = new BooleanColumnAccessor(dup) - case BYTE.typeId = new ByteColumnAccessor(dup) - case SHORT.typeId = new ShortColumnAccessor(dup) - case STRING.typeId= new StringColumnAccessor(dup) - case DATE.typeId = new DateColumnAccessor(dup) - case TIMESTAMP.typeId = new TimestampColumnAccessor(dup) - case BINARY.typeId= new BinaryColumnAccessor(dup) - case GENERIC.typeId = new GenericColumnAccessor(dup) + +// The first 4 bytes in the buffer indicate the column type. This field is not used now, +// because we always know the data type of the column ahead of time. +dup.getInt() --- End diff -- This call has side effect, still need to call it to read 4 bytes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6224][SQL] Also collect NamedExpression...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4949#issuecomment-77853448 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28390/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6195] [SQL] Adds in-memory column type ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4938#issuecomment-77860216 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28391/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6195] [SQL] Adds in-memory column type ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/4938#discussion_r26034375 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala --- @@ -107,24 +110,28 @@ private[sql] class GenericColumnAccessor(buffer: ByteBuffer) with NullableColumnAccessor private[sql] object ColumnAccessor { - def apply(buffer: ByteBuffer): ColumnAccessor = { + def apply(dataType: DataType, buffer: ByteBuffer): ColumnAccessor = { val dup = buffer.duplicate().order(ByteOrder.nativeOrder) -// The first 4 bytes in the buffer indicate the column type. -val columnTypeId = dup.getInt() - -columnTypeId match { - case INT.typeId = new IntColumnAccessor(dup) - case LONG.typeId = new LongColumnAccessor(dup) - case FLOAT.typeId = new FloatColumnAccessor(dup) - case DOUBLE.typeId= new DoubleColumnAccessor(dup) - case BOOLEAN.typeId = new BooleanColumnAccessor(dup) - case BYTE.typeId = new ByteColumnAccessor(dup) - case SHORT.typeId = new ShortColumnAccessor(dup) - case STRING.typeId= new StringColumnAccessor(dup) - case DATE.typeId = new DateColumnAccessor(dup) - case TIMESTAMP.typeId = new TimestampColumnAccessor(dup) - case BINARY.typeId= new BinaryColumnAccessor(dup) - case GENERIC.typeId = new GenericColumnAccessor(dup) + +// The first 4 bytes in the buffer indicate the column type. This field is not used now, +// because we always know the data type of the column ahead of time. +dup.getInt() --- End diff -- However, we can remove this line after removing the whole column type ID stuff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means
GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/4951 [SPARK-5986][MLLib] Add save/load for k-means This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yinxusen/spark SPARK-5986 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4951.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4951 commit dce70553cb0e5c25d1bb0a415929eb5066af964a Author: Xusen Yin yinxu...@gmail.com Date: 2015-03-09T14:12:59Z add save/load for k-means for SPARK-5986 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user levkhomich commented on a diff in the pull request: https://github.com/apache/spark/pull/4947#discussion_r26041766 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -158,7 +158,13 @@ private[spark] class KryoSerializerInstance(ks: KryoSerializer) extends Serializ override def serialize[T: ClassTag](t: T): ByteBuffer = { output.clear() -kryo.writeClassAndObject(output, t) +try { + kryo.writeClassAndObject(output, t) +} catch { + case e: KryoException if e.getMessage.startsWith(Buffer overflow) = +throw new SparkException(Serialization failed: Kryo buffer overflow. To avoid this, + --- End diff -- Sure, you can check example of stack trace [here](http://pastebin.com/VSb2gisk). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4947#issuecomment-77874158 [Test build #28397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28397/consoleFull) for PR 4947 at commit [`0f7a947`](https://github.com/apache/spark/commit/0f7a947ac9de8ef66511b78822809aa414cf3ea7). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6201] [SQL] promote string and do widen...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/4945#discussion_r26044965 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala --- @@ -220,6 +220,22 @@ trait HiveTypeCoercion { b.makeCopy(Array(newLeft, newRight)) }.getOrElse(b) // If there is no applicable conversion, leave expression unchanged. } + + // Also widen types for InExpressions. + case q: LogicalPlan = q transformExpressions { +// Skip nodes who's children have not been resolved yet. +case e if !e.childrenResolved = e + +case i @ In(a, b) if b.exists(_.dataType != a.dataType) = + b.map(_.dataType).foldLeft(None: Option[DataType])((r, c) = r match { +case None = Some(c) +case Some(dt) = findTightestCommonType(dt, c) + }) match { +// If there is no applicable conversion, leave expression unchanged. +case None = i.makeCopy(Array(a, b)) --- End diff -- Leave it as `i` instead of the `i.makeCopy(..)`? Or throwing exception? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user levkhomich commented on a diff in the pull request: https://github.com/apache/spark/pull/4947#discussion_r26038649 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -158,7 +158,13 @@ private[spark] class KryoSerializerInstance(ks: KryoSerializer) extends Serializ override def serialize[T: ClassTag](t: T): ByteBuffer = { output.clear() -kryo.writeClassAndObject(output, t) +try { + kryo.writeClassAndObject(output, t) +} catch { + case e: KryoException if e.getMessage.startsWith(Buffer overflow) = +throw new SparkException(Serialization failed: Kryo buffer overflow. To avoid this, + --- End diff -- Original exception is preserved as `cause`, so it is printed anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4951#issuecomment-77861968 [Test build #28394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28394/consoleFull) for PR 4951 at commit [`b144216`](https://github.com/apache/spark/commit/b144216f741776fdfe4c8e95d63650bd46c659d5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4044 [CORE] Thriftserver fails to start ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4873#issuecomment-77863414 Sorry to bug @pwendell again but I think you may also be familiar with this script. I went to the extreme and removed the check for Hive jars entirely. Datanucleus goes on the classpath if it exists, full stop. This also resolves the JAR issue. But is there a reason that's a bad idea? Like, if I didn't build with Hive, but Datanucleus is lying around, does that cause a problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6198][SQL] Support select current_data...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/4926#discussion_r26041674 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala --- @@ -179,7 +179,12 @@ private[hive] case class HiveGenericUdf(funcWrapper: HiveFunctionWrapper, childr }) i += 1 } -unwrap(function.evaluate(deferedObjects), returnInspector) + +if (function.getUdfName().endsWith(UDFCurrentDB)) { --- End diff -- Can you explain why you think returning a `null` is more reasonable than executing the `UDFCurrentDB`? Seems it will not throws exception anymore in Hive 0.14: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/0.14.0/org/apache/hadoop/hive/ql/udf/generic/UDFCurrentDB.java/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6201] [SQL] promote string and do widen...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/4945#discussion_r26044586 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala --- @@ -269,6 +285,14 @@ trait HiveTypeCoercion { i.makeCopy(Array(Cast(a, StringType), b.map(Cast(_, StringType case i @ In(a, b) if a.dataType == TimestampType b.forall(_.dataType == DateType) = i.makeCopy(Array(Cast(a, StringType), b.map(Cast(_, StringType + case i @ In(a, b) if a.dataType == StringType + b.exists(_.dataType.isInstanceOf[NumericType]) = +i.makeCopy(Array(Cast(a, DoubleType), b)) + case i @ In(a, b) if b.exists(_.dataType == StringType) + a.dataType.isInstanceOf[NumericType] = +i.makeCopy(Array(a, b.map(_.dataType match{ + case StringType = Cast(a, DoubleType) --- End diff -- Causes unmatched exception? ```scala case StringType = Cast(a, DoubleType) case x = x ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user levkhomich commented on a diff in the pull request: https://github.com/apache/spark/pull/4947#discussion_r26044200 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -158,7 +158,13 @@ private[spark] class KryoSerializerInstance(ks: KryoSerializer) extends Serializ override def serialize[T: ClassTag](t: T): ByteBuffer = { output.clear() -kryo.writeClassAndObject(output, t) +try { + kryo.writeClassAndObject(output, t) +} catch { + case e: KryoException if e.getMessage.startsWith(Buffer overflow) = +throw new SparkException(Serialization failed: Kryo buffer overflow. To avoid this, + --- End diff -- @srowen @viirya I've squashed corresponding change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4734][Streaming]limit the file Dstream ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3597#issuecomment-77874151 Mind closing this PR? I do not think this change is right for Spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5817] [SQL] Fix bug of udtf with column...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4602#issuecomment-77879123 [Test build #28395 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28395/consoleFull) for PR 4602 at commit [`7fa6e0d`](https://github.com/apache/spark/commit/7fa6e0d3e3cf83072e4dcf37fe24a89bdf0f8da1). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Explode(child: Expression)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4044 [CORE] Thriftserver fails to start ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4873#issuecomment-77880845 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28396/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] SPARK-2614: (2nd patch) Create a spark...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1611#issuecomment-77875648 Mind closing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4950#issuecomment-77875679 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28392/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6051][Streaming] Add ZooKeeper offest p...
Github user koeninger commented on the pull request: https://github.com/apache/spark/pull/4805#issuecomment-77882744 As it stands now, no offsets are stored by spark unless you're checkpointing. Does it really make sense to have an option to automatically store offsets in Kafka, but not store offsets in the checkpoint? Failure recovery in that case depends on user provided starting offsets (or starting at the beginning / end of the log). If someone has the sophistication to get offsets from kafka in order to provide them as a starting point, they probably have the sophistication to save offsets to kafka themselves in the job. If offsets are only being sent to Kafka when they are also stored in the checkpoint, then does sending offsets to kafka in compute() also make sense? Yes, you can lag behind, but those offsets are in the queue to get processed at least once. I'm not 100% sure on the answer to this, its more a question of desired behavior, but that's why I brought it up. On Mon, Mar 9, 2015 at 12:14 AM, Saisai Shao notificati...@github.com wrote: Hi @koeninger https://github.com/koeninger , would you please review this again? Thanks a lot and appreciate your time. Here I still keep using the HashMap for Time - offset relation mapping, since checkpoint data will only be updated when checkpoint is enabled, I hope this could also be worked even without checkpoint enabled. And I still use StreamingListener to update the offset, the reason is mentioned before. Besides I updated the configuration name, not sure is it suitable. Thanks a lot. — Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/4805#issuecomment-77801344. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4634#issuecomment-77862472 @pwendell @rxin I'd like to merge this, and while I'm all but sure the API change question is OK, I'd feel better if a maintainer could give it a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3188][MLLIB]: Add Robust Regression Alg...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2110#issuecomment-77870526 I think this contribution may have timed out, along with https://github.com/apache/spark/pull/2096 . They're probably good implementations, but I am not clear if this will be taken forward to be part of Spark. In any event it doesn't merge and is not necessarily written for the new ML pipelines API. Does anyone else have an opinion on whether this should be closed out, or needs to be revived? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6051][Streaming] Add ZooKeeper offest p...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/4805#discussion_r26048829 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala --- @@ -84,6 +83,11 @@ class DirectKafkaInputDStream[ protected var currentOffsets = fromOffsets + // Map to manage the time - topic/partition+offset + private val offsetMap = new mutable.HashMap[Time, Map[TopicAndPartition, Long]]() + // Add to the listener bus for job completion hook + context.addStreamingListener(new DirectKafkaStreamingListener) + @tailrec protected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = { --- End diff -- Is there a reason to even add the streaming listener if the configuration option isn't turned on? If the config option isn't on, couldn't you skip the listener and skip adding / removing items to the offset map altogether? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6191] [EC2] Generalize ability to downl...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/4919#issuecomment-77883455 Yeah, if @JoshRosen (who wrote the original `setup_boto()` function) can't take a look, maybe @shivaram can give this a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6191] [EC2] Generalize ability to downl...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4919#issuecomment-77869609 Obviously I'd like to get another actual active EC2 user to review this, but the principle looks fine. this is refactoring the boto-specific mechanism to be general and at the moment does not change behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3181][MLLIB]: Add Robust Regression Alg...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2096#issuecomment-77870482 I think this contribution may have timed out, along with https://github.com/apache/spark/pull/2110 . They're probably good implementations, but I am not clear if this will be taken forward to be part of Spark. In any event it doesn't merge and is not necessarily written for the new ML pipelines API. Does anyone else have an opinion on whether this should be closed out, or needs to be revived? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4951#issuecomment-77878863 [Test build #28394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28394/consoleFull) for PR 4951 at commit [`b144216`](https://github.com/apache/spark/commit/b144216f741776fdfe4c8e95d63650bd46c659d5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4951#issuecomment-77878875 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28394/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4044 [CORE] Thriftserver fails to start ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4873#issuecomment-77880824 [Test build #28396 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28396/consoleFull) for PR 4873 at commit [`18b53a0`](https://github.com/apache/spark/commit/18b53a01cdaf471580497c81629625173194b62d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4044 [CORE] Thriftserver fails to start ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4873#issuecomment-77863875 [Test build #28396 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28396/consoleFull) for PR 4873 at commit [`18b53a0`](https://github.com/apache/spark/commit/18b53a01cdaf471580497c81629625173194b62d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6223][SQL] Fix build warning- enable im...
Github user vinodkc closed the pull request at: https://github.com/apache/spark/pull/4948 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4947#discussion_r26041428 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -158,7 +158,13 @@ private[spark] class KryoSerializerInstance(ks: KryoSerializer) extends Serializ override def serialize[T: ClassTag](t: T): ByteBuffer = { output.clear() -kryo.writeClassAndObject(output, t) +try { + kryo.writeClassAndObject(output, t) +} catch { + case e: KryoException if e.getMessage.startsWith(Buffer overflow) = +throw new SparkException(Serialization failed: Kryo buffer overflow. To avoid this, + --- End diff -- The cause stack trace / message would be printed by `printStackTrace`. It would not become part of the message from this new `SparkException`. Net-net I think it wouldn't hurt to just add additional info to the new `SparkException` message if it's deemed useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4947#issuecomment-77874328 LGTM. I'll wait a bit longer for more comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4950#issuecomment-77875660 [Test build #28392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28392/consoleFull) for PR 4950 at commit [`c67985b`](https://github.com/apache/spark/commit/c67985b01538a8e4ede806ce7e7b23af7a985a65). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5817] [SQL] Fix bug of udtf with column...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4602#issuecomment-77879139 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28395/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5817] [SQL] Fix bug of udtf with column...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4602#issuecomment-77862925 [Test build #28395 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28395/consoleFull) for PR 4602 at commit [`7fa6e0d`](https://github.com/apache/spark/commit/7fa6e0d3e3cf83072e4dcf37fe24a89bdf0f8da1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6198][SQL] Support select current_data...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/4926#issuecomment-77866817 `SELECT 1` Seems doesn't work in Hive 0.12, probably introduced since Hive 0.13. See:https://issues.apache.org/jira/browse/HIVE-4144 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4947#discussion_r26041118 --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala --- @@ -158,7 +158,13 @@ private[spark] class KryoSerializerInstance(ks: KryoSerializer) extends Serializ override def serialize[T: ClassTag](t: T): ByteBuffer = { output.clear() -kryo.writeClassAndObject(output, t) +try { + kryo.writeClassAndObject(output, t) +} catch { + case e: KryoException if e.getMessage.startsWith(Buffer overflow) = +throw new SparkException(Serialization failed: Kryo buffer overflow. To avoid this, + --- End diff -- But as the Exception's Constructor Detail (http://docs.oracle.com/javase/7/docs/api/java/lang/Exception.html#Exception(java.lang.String,%20java.lang.Throwable) states, Note that the detail message associated with cause is not automatically incorporated in this exception's detail message. Is it sure that it will be printed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] SPARK-3624: Failed to find Spark assem...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2477#issuecomment-77875745 Mind closing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4951#issuecomment-77877303 [Test build #28393 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28393/consoleFull) for PR 4951 at commit [`dce7055`](https://github.com/apache/spark/commit/dce70553cb0e5c25d1bb0a415929eb5066af964a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class KMeansModel (val clusterCenters: Array[Vector]) extends Saveable with Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5986][MLLib] Add save/load for k-means
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4951#issuecomment-77877321 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28393/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6051][Streaming] Add ZooKeeper offest p...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/4805#discussion_r26048624 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala --- @@ -118,6 +123,7 @@ class DirectKafkaInputDStream[ context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler) currentOffsets = untilOffsets.map(kv = kv._1 - kv._2.offset) +offsetMap += ((validTime, currentOffsets)) --- End diff -- Don't all mutations of the offsetMap need to be synchronized? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4947#issuecomment-77892155 [Test build #28397 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28397/consoleFull) for PR 4947 at commit [`0f7a947`](https://github.com/apache/spark/commit/0f7a947ac9de8ef66511b78822809aa414cf3ea7). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class BinaryClassificationMetrics(JavaModelWrapper):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6025] [MLlib] Add helper method evaluat...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4906#discussion_r26056478 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -69,6 +74,42 @@ class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy) case _ = throw new IllegalArgumentException(s$algo is not supported by the gradient boosting.) } +baseLearners = fitGradientBoostingModel.trees +baseLearnerWeights = fitGradientBoostingModel.treeWeights +fitGradientBoostingModel + } + + /** + * Method to compute error or loss for every iteration of gradient boosting. + * @param data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] + * @param loss: evaluation metric that defaults to boostingStrategy.loss + * @return an array with index i having the losses or errors for the ensemble + * containing trees 1 to i + 1 + */ + def evaluateEachIteration( --- End diff -- This method should be implemented in the model, not in the estimator. There's no need to make a duplicate of the model in the estimator class. (We try to keep estimator classes stateless except for parameter values so that they remain lightweight types.) This change will require a bit of refactoring, so I'll hold off on more comments until then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3454] [WIP] separate json endpoints for...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4435#discussion_r26061359 --- Diff: core/src/main/scala/org/apache/spark/status/StatusJsonHandler.scala --- @@ -0,0 +1,168 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.status + +import javax.servlet.http.{HttpServletResponse, HttpServlet, HttpServletRequest} + +import com.fasterxml.jackson.annotation.JsonInclude +import com.fasterxml.jackson.databind.{SerializationFeature, ObjectMapper} +import org.apache.spark.status.api.ApplicationInfo +import org.apache.spark.ui.SparkUI +import org.apache.spark.ui.exec.ExecutorsJsonRoute +import org.apache.spark.ui.jobs.{AllJobsJsonRoute, OneStageJsonRoute, AllStagesJsonRoute} +import org.apache.spark.ui.storage.{AllRDDJsonRoute, RDDJsonRoute} +import org.eclipse.jetty.servlet.{ServletHolder, ServletContextHandler} + +import scala.util.matching.Regex + +import org.apache.spark.{Logging, SecurityManager} +import org.apache.spark.deploy.history.{OneApplicationJsonRoute, AllApplicationsJsonRoute} + + +/** + * get the response for one endpoint in the json status api. + * + * Implementations only need to return the objects that are to be converted to json -- the framework + * will convert to json via jackson + */ +private[spark] trait StatusJsonRoute[T] { + def renderJson(request: HttpServletRequest): T +} + +private[spark] class JsonRequestHandler(uiRoot: UIRoot, securityManager: SecurityManager) extends Logging { + def route(req: HttpServletRequest) : Option[StatusJsonRoute[_]] = { +specs.collectFirst { case (pattern, route) if pattern.pattern.matcher(req.getPathInfo()).matches() = + route +} + } + + private val noSlash = [^/] + + private val specs: IndexedSeq[(Regex, StatusJsonRoute[_])] = IndexedSeq( +/applications/?.r - new AllApplicationsJsonRoute(uiRoot), +s/applications/$noSlash+/?.r - new OneApplicationJsonRoute(uiRoot), +s/applications/$noSlash+/jobs/?.r - new AllJobsJsonRoute(this), +s/applications/$noSlash+/executors/?.r - new ExecutorsJsonRoute(this), +s/applications/$noSlash+/stages/?.r - new AllStagesJsonRoute(this), +s/applications/$noSlash+/stages/$noSlash+/?.r - new OneStageJsonRoute(this), +s/applications/$noSlash+/storage/rdd/?.r - new AllRDDJsonRoute(this), +s/applications/$noSlash+/storage/rdd/$noSlash+/?.r - new RDDJsonRoute(this) + ) + + private val jsonMapper = { +val t = new ObjectMapper() +t.registerModule(com.fasterxml.jackson.module.scala.DefaultScalaModule) +t.enable(SerializationFeature.INDENT_OUTPUT) +t.setSerializationInclusion(JsonInclude.Include.NON_NULL) +t + } + + val jsonContextHandler = { + +//TODO throw out all the JettyUtils stuff, so I can set the response status code, etc. +val servlet = new HttpServlet { + override def doGet(request: HttpServletRequest, response: HttpServletResponse) { +if (securityManager.checkUIViewPermissions(request.getRemoteUser)) { + response.setContentType(text/json;charset=utf-8) + route(request) match { +case Some(jsonRoute) = + response.setHeader(Cache-Control, no-cache, no-store, must-revalidate) + try { +val responseObj = jsonRoute.renderJson(request) +val result = jsonMapper.writeValueAsString(responseObj) +response.setStatus(HttpServletResponse.SC_OK) +response.getWriter.println(result) + } catch { +case iae: IllegalArgumentException = + response.setStatus(HttpServletResponse.SC_BAD_REQUEST) + response.getOutputStream.print(iae.getMessage()) + } +case None = + println(no match for path: +
[GitHub] spark pull request: [Build] SPARK-2614: (2nd patch) Create a spark...
Github user tzolov closed the pull request at: https://github.com/apache/spark/pull/1611 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6201] [SQL] promote string and do widen...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/4945#discussion_r26054740 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala --- @@ -269,6 +285,14 @@ trait HiveTypeCoercion { i.makeCopy(Array(Cast(a, StringType), b.map(Cast(_, StringType case i @ In(a, b) if a.dataType == TimestampType b.forall(_.dataType == DateType) = i.makeCopy(Array(Cast(a, StringType), b.map(Cast(_, StringType + case i @ In(a, b) if a.dataType == StringType + b.exists(_.dataType.isInstanceOf[NumericType]) = +i.makeCopy(Array(Cast(a, DoubleType), b)) + case i @ In(a, b) if b.exists(_.dataType == StringType) + a.dataType.isInstanceOf[NumericType] = +i.makeCopy(Array(a, b.map(_.dataType match{ + case StringType = Cast(a, DoubleType) --- End diff -- Same as above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/4947#issuecomment-77892679 Is this not needed or `serializeStream` as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5682] Reuse hadoop encrypted shuffle al...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4491#issuecomment-77904109 Hi @kellyzly , Renaming the PR sounds fine. But I see that the PR still has the old code. Are you planning on having the updated code up here soon? Otherwise, as @srowen suggests, we should close this, and you can open a new PR when you've addressed the issues with the current approach. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] SPARK-3624: Failed to find Spark assem...
Github user tzolov commented on the pull request: https://github.com/apache/spark/pull/2477#issuecomment-77895022 i'm closing this PR as this functionality is deprecated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] SPARK-2614: (2nd patch) Create a spark...
Github user tzolov commented on the pull request: https://github.com/apache/spark/pull/1611#issuecomment-77895099 i'm closing this PR as this functionality is deprecated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/4634#discussion_r26056981 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala --- @@ -233,18 +235,44 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) def combineByKey[C](createCombiner: JFunction[V, C], mergeValue: JFunction2[C, V, C], mergeCombiners: JFunction2[C, C, C], -partitioner: Partitioner): JavaPairRDD[K, C] = { +partitioner: Partitioner, +mapSideCombine: Boolean, +serializer: Serializer): JavaPairRDD[K, C] = { --- End diff -- looks ok. it would be better to add serializer to the doc if possible. also style wise, can you indent 4 spaces for the function parameters? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6186] [EC2] Make Tachyon version config...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/4901#discussion_r26060596 --- Diff: ec2/spark_ec2.py --- @@ -872,9 +886,13 @@ def deploy_files(conn, root_dir, opts, master_nodes, slave_nodes, modules): if . in opts.spark_version: # Pre-built Spark deploy spark_v = get_validate_spark_version(opts.spark_version, opts.spark_git_repo) +tachyon_v = get_tachyon_version(spark_v) else: # Spark-only custom deploy spark_v = %s|%s % (opts.spark_git_repo, opts.spark_version) +tachyon_v = +print Deploy spark via git hash, Tachyon won't be set up --- End diff -- `Deploy spark` - `Deploying Spark` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6186] [EC2] Make Tachyon version config...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/4901#issuecomment-77911329 Thanks @uronce-cc - Change looks good to me but for the minor comment inline. @nchammas -- Any other comments ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/4634#discussion_r26056992 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala --- @@ -233,18 +235,44 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) def combineByKey[C](createCombiner: JFunction[V, C], mergeValue: JFunction2[C, V, C], mergeCombiners: JFunction2[C, C, C], -partitioner: Partitioner): JavaPairRDD[K, C] = { +partitioner: Partitioner, +mapSideCombine: Boolean, +serializer: Serializer): JavaPairRDD[K, C] = { implicit val ctag: ClassTag[C] = fakeClassTag fromRDD(rdd.combineByKey( createCombiner, mergeValue, mergeCombiners, - partitioner + partitioner, + mapSideCombine, + serializer )) } /** - * Simplified version of combineByKey that hash-partitions the output RDD. + * Generic function to combine the elements for each key using a custom set of aggregation + * functions. Turns a JavaPairRDD[(K, V)] into a result of type JavaPairRDD[(K, C)], for a + * combined type C * Note that V and C can be different -- for example, one might group an + * RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Users provide three + * functions: + * + * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list) + * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list) + * - `mergeCombiners`, to combine two C's into a single one. + * + * In addition, users can control the partitioning of the output RDD. This method automatically + * uses map-side aggregation in shuffling the RDD. + */ + def combineByKey[C](createCombiner: JFunction[V, C], +mergeValue: JFunction2[C, V, C], --- End diff -- 4 space indent here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2669] [yarn] Distribute client configur...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4142#issuecomment-77899873 Ping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4634#issuecomment-77903115 Serializer seems ok to add. One thing I am not sure about is the mapSideCombine thing -- I'm never a fan of that parameter even though I added it myself, for the following reasons: 1. mapSideCombine is a MR term used in Hive that doesn't mean much outside of MR. A more proper name is partialAggregation. 2. The underlying implementation should be able to avoid partial aggregation if it finds that partial aggregation is expensive (i.e. after trying 1 records, check whether the hash table size is less than a specific threshold). It is one of the things we can easily auto tune. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3454] [WIP] separate json endpoints for...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4435#discussion_r26061003 --- Diff: core/src/main/scala/org/apache/spark/status/StatusJsonHandler.scala --- @@ -0,0 +1,168 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.status + +import javax.servlet.http.{HttpServletResponse, HttpServlet, HttpServletRequest} + +import com.fasterxml.jackson.annotation.JsonInclude +import com.fasterxml.jackson.databind.{SerializationFeature, ObjectMapper} +import org.apache.spark.status.api.ApplicationInfo +import org.apache.spark.ui.SparkUI +import org.apache.spark.ui.exec.ExecutorsJsonRoute +import org.apache.spark.ui.jobs.{AllJobsJsonRoute, OneStageJsonRoute, AllStagesJsonRoute} +import org.apache.spark.ui.storage.{AllRDDJsonRoute, RDDJsonRoute} +import org.eclipse.jetty.servlet.{ServletHolder, ServletContextHandler} + +import scala.util.matching.Regex + +import org.apache.spark.{Logging, SecurityManager} +import org.apache.spark.deploy.history.{OneApplicationJsonRoute, AllApplicationsJsonRoute} + + +/** + * get the response for one endpoint in the json status api. + * + * Implementations only need to return the objects that are to be converted to json -- the framework + * will convert to json via jackson + */ +private[spark] trait StatusJsonRoute[T] { + def renderJson(request: HttpServletRequest): T +} + +private[spark] class JsonRequestHandler(uiRoot: UIRoot, securityManager: SecurityManager) extends Logging { + def route(req: HttpServletRequest) : Option[StatusJsonRoute[_]] = { +specs.collectFirst { case (pattern, route) if pattern.pattern.matcher(req.getPathInfo()).matches() = + route +} + } + + private val noSlash = [^/] + + private val specs: IndexedSeq[(Regex, StatusJsonRoute[_])] = IndexedSeq( +/applications/?.r - new AllApplicationsJsonRoute(uiRoot), +s/applications/$noSlash+/?.r - new OneApplicationJsonRoute(uiRoot), +s/applications/$noSlash+/jobs/?.r - new AllJobsJsonRoute(this), +s/applications/$noSlash+/executors/?.r - new ExecutorsJsonRoute(this), +s/applications/$noSlash+/stages/?.r - new AllStagesJsonRoute(this), +s/applications/$noSlash+/stages/$noSlash+/?.r - new OneStageJsonRoute(this), +s/applications/$noSlash+/storage/rdd/?.r - new AllRDDJsonRoute(this), +s/applications/$noSlash+/storage/rdd/$noSlash+/?.r - new RDDJsonRoute(this) + ) + + private val jsonMapper = { +val t = new ObjectMapper() +t.registerModule(com.fasterxml.jackson.module.scala.DefaultScalaModule) +t.enable(SerializationFeature.INDENT_OUTPUT) +t.setSerializationInclusion(JsonInclude.Include.NON_NULL) +t + } + + val jsonContextHandler = { + +//TODO throw out all the JettyUtils stuff, so I can set the response status code, etc. +val servlet = new HttpServlet { + override def doGet(request: HttpServletRequest, response: HttpServletResponse) { +if (securityManager.checkUIViewPermissions(request.getRemoteUser)) { + response.setContentType(text/json;charset=utf-8) + route(request) match { +case Some(jsonRoute) = + response.setHeader(Cache-Control, no-cache, no-store, must-revalidate) + try { +val responseObj = jsonRoute.renderJson(request) +val result = jsonMapper.writeValueAsString(responseObj) +response.setStatus(HttpServletResponse.SC_OK) +response.getWriter.println(result) + } catch { +case iae: IllegalArgumentException = + response.setStatus(HttpServletResponse.SC_BAD_REQUEST) + response.getOutputStream.print(iae.getMessage()) + } +case None = + println(no match for path: +
[GitHub] spark pull request: [SPARK-3454] [WIP] separate json endpoints for...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4435#discussion_r26061127 --- Diff: core/src/main/scala/org/apache/spark/status/api/ApplicationInfo.scala --- @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.status.api + +case class ApplicationInfo( --- End diff -- I agree; I think a single file would be clearer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] SPARK-3624: Failed to find Spark assem...
Github user tzolov closed the pull request at: https://github.com/apache/spark/pull/2477 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6201] [SQL] promote string and do widen...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/4945#discussion_r26054354 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala --- @@ -269,6 +285,14 @@ trait HiveTypeCoercion { i.makeCopy(Array(Cast(a, StringType), b.map(Cast(_, StringType case i @ In(a, b) if a.dataType == TimestampType b.forall(_.dataType == DateType) = i.makeCopy(Array(Cast(a, StringType), b.map(Cast(_, StringType + case i @ In(a, b) if a.dataType == StringType + b.exists(_.dataType.isInstanceOf[NumericType]) = +i.makeCopy(Array(Cast(a, DoubleType), b)) --- End diff -- As I've commented on the JIRA ticket, this is not the behavior of Hive. Hive actually converts the numerics in the constant set into strings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6025] [MLlib] Add helper method evaluat...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4906#discussion_r26056473 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -69,6 +74,42 @@ class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy) case _ = throw new IllegalArgumentException(s$algo is not supported by the gradient boosting.) } +baseLearners = fitGradientBoostingModel.trees +baseLearnerWeights = fitGradientBoostingModel.treeWeights +fitGradientBoostingModel + } + + /** + * Method to compute error or loss for every iteration of gradient boosting. + * @param data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] + * @param loss: evaluation metric that defaults to boostingStrategy.loss + * @return an array with index i having the losses or errors for the ensemble + * containing trees 1 to i + 1 + */ + def evaluateEachIteration( + data: RDD[LabeledPoint], + loss: Loss = boostingStrategy.loss) : Array[Double] = { + +val algo = boostingStrategy.treeStrategy.algo +val remappedData = algo match { + case Classification = data.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) + case _ = data +} +val initialTree = baseLearners(0) +val evaluationArray = Array.fill(numIterations)(0.0) + +// Initial weight is 1.0 +var predictionRDD = remappedData.map(i = initialTree.predict(i.features)) +evaluationArray(0) = loss.computeError(remappedData, predictionRDD) + +(1 until numIterations).map {nTree = --- End diff -- This does numIterations maps, broadcasting the model numIterations times. I'd recommend using a broadcast variable for the model to make sure it's only sent once. You could keep the current approach pretty much as-is, but it does numIterations actions, so it's a bit inefficient. You could optimize it by using only 1 map, but that would require modifying the computeError method as follows: * computeError could be overloaded to take (prediction: Double, datum: LabeledPoint). This could replace the computeError method you implemented. * Here, in evaluateEachIteration, you could call predictionRDD.map, and within the map, for each data point, you could evaluate each tree on the data point, compute the prediction from each iteration via a cumulative sum, and then calling computeError on each prediction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6191] [EC2] Generalize ability to downl...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4919#issuecomment-77905979 This seems fine to me. I guess the alternatives would be 1. storing the libraries in our source tree, which is a bad option for several reasons, including licensing, file size, upgradability, etc. 2. requiring the users to install the libraries themselves using a `pip` requirements file, but that adds another dependency on pip I think that this is fine for now. As part of our binary release packaging scripts, we could download and include these archives so that only users who build from source will need to perform these downloads. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3454] [WIP] separate json endpoints for...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4435#discussion_r26060515 --- Diff: core/src/test/scala/org/apache/spark/status/JsonRequestHandlerTest.scala --- @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.status + +import org.apache.spark.JobExecutionStatus +import org.apache.spark.status.api.StageStatus +import org.scalatest.{Matchers, FunSuite} + +class JsonRequestHandlerTest extends FunSuite with Matchers { --- End diff -- This should be named `JsonRequestHandlerSuite` to be consistent with the `*Suite` naming convention that we use for our tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6087][CORE] Provide actionable exceptio...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4947#issuecomment-77892167 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28397/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6228] [network] Move SASL classes from ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4953#issuecomment-77946908 It's strictly a code move, from child to parent module. Although I've never been that familiar with this code, I understand the motivation, to use it from the other child module, which seems sound. I'll let it stay open for a day or two in case there are other thoughts. If not I think this can merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6225 [CORE] [SQL] [STREAMING] Resolve mo...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4950#discussion_r26078523 --- Diff: external/kafka/src/test/java/org/apache/spark/streaming/kafka/JavaKafkaRDDSuite.java --- @@ -19,23 +19,19 @@ import java.io.Serializable; import java.util.HashMap; -import java.util.HashSet; -import java.util.Arrays; - -import org.apache.spark.SparkConf; import scala.Tuple2; -import junit.framework.Assert; - import kafka.common.TopicAndPartition; import kafka.message.MessageAndMetadata; import kafka.serializer.StringDecoder; +import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; +import org.junit.Assert; --- End diff -- Organize imports as long as you're at it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r26079948 --- Diff: bin/spark-sql --- @@ -43,15 +46,12 @@ function usage { echo echo CLI options: $FWDIR/bin/spark-class $CLASS --help 21 | grep -v $pattern 12 + exit $2 --- End diff -- ``` exit $2 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r26079926 --- Diff: bin/spark-sql --- @@ -25,12 +25,15 @@ set -o posix # NOTE: This exact class name is matched downstream by SparkSubmit. # Any changes need to be reflected there. -CLASS=org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver +export CLASS=org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver # Figure out where Spark is installed -FWDIR=$(cd `dirname $0`/..; pwd) +export FWDIR=$(cd `dirname $0`/..; pwd) function usage { + if [ -n $1 ]; then +echo $1 --- End diff -- ``` echo $1 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r26079987 --- Diff: bin/spark-submit --- @@ -17,58 +17,18 @@ # limitations under the License. # -# NOTE: Any changes in this file must be reflected in SparkSubmitDriverBootstrapper.scala! - -export SPARK_HOME=$(cd `dirname $0`/..; pwd) -ORIG_ARGS=($@) - -# Set COLUMNS for progress bar -export COLUMNS=`tput cols` - -while (($#)); do - if [ $1 = --deploy-mode ]; then -SPARK_SUBMIT_DEPLOY_MODE=$2 - elif [ $1 = --properties-file ]; then -SPARK_SUBMIT_PROPERTIES_FILE=$2 - elif [ $1 = --driver-memory ]; then -export SPARK_SUBMIT_DRIVER_MEMORY=$2 - elif [ $1 = --driver-library-path ]; then -export SPARK_SUBMIT_LIBRARY_PATH=$2 - elif [ $1 = --driver-class-path ]; then -export SPARK_SUBMIT_CLASSPATH=$2 - elif [ $1 = --driver-java-options ]; then -export SPARK_SUBMIT_OPTS=$2 - elif [ $1 = --master ]; then -export MASTER=$2 - fi - shift -done - -if [ -z $SPARK_CONF_DIR ]; then - export SPARK_CONF_DIR=$SPARK_HOME/conf -fi -DEFAULT_PROPERTIES_FILE=$SPARK_CONF_DIR/spark-defaults.conf -if [ $MASTER == yarn-cluster ]; then - SPARK_SUBMIT_DEPLOY_MODE=cluster +SPARK_HOME=$(cd `dirname $0`/..; pwd) + +# Only define a usage function if an upstream script hasn't done so. +if ! type -t usage /dev/null 21; then + usage() { +if [ -n $1 ]; then + echo $1 +fi +$SPARK_HOME/bin/spark-class org.apache.spark.deploy.SparkSubmit --help +exit $2 --- End diff -- ``` exit $2 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/3916#discussion_r26079978 --- Diff: bin/spark-submit --- @@ -17,58 +17,18 @@ # limitations under the License. # -# NOTE: Any changes in this file must be reflected in SparkSubmitDriverBootstrapper.scala! - -export SPARK_HOME=$(cd `dirname $0`/..; pwd) -ORIG_ARGS=($@) - -# Set COLUMNS for progress bar -export COLUMNS=`tput cols` - -while (($#)); do - if [ $1 = --deploy-mode ]; then -SPARK_SUBMIT_DEPLOY_MODE=$2 - elif [ $1 = --properties-file ]; then -SPARK_SUBMIT_PROPERTIES_FILE=$2 - elif [ $1 = --driver-memory ]; then -export SPARK_SUBMIT_DRIVER_MEMORY=$2 - elif [ $1 = --driver-library-path ]; then -export SPARK_SUBMIT_LIBRARY_PATH=$2 - elif [ $1 = --driver-class-path ]; then -export SPARK_SUBMIT_CLASSPATH=$2 - elif [ $1 = --driver-java-options ]; then -export SPARK_SUBMIT_OPTS=$2 - elif [ $1 = --master ]; then -export MASTER=$2 - fi - shift -done - -if [ -z $SPARK_CONF_DIR ]; then - export SPARK_CONF_DIR=$SPARK_HOME/conf -fi -DEFAULT_PROPERTIES_FILE=$SPARK_CONF_DIR/spark-defaults.conf -if [ $MASTER == yarn-cluster ]; then - SPARK_SUBMIT_DEPLOY_MODE=cluster +SPARK_HOME=$(cd `dirname $0`/..; pwd) + +# Only define a usage function if an upstream script hasn't done so. +if ! type -t usage /dev/null 21; then + usage() { +if [ -n $1 ]; then + echo $1 --- End diff -- ``` echo $1 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org