[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3891#issuecomment-68833142 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25090/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala(override the...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3898#issuecomment-68833371 Finally, please file a JIRA and add it to the PR title. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3820#issuecomment-68834477 Thanks for doing this, I've been getting a ton of requests for this feature! Can you also add this to the sql programming guide? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4843 [YARN] Squash ExecutorRunnableUtil ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3696#issuecomment-68834508 LGTM, so I'm going to merge this into `master` (1.3.0). Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4843 [YARN] Squash ExecutorRunnableUtil ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3696 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3891#issuecomment-68832961 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3891#issuecomment-68833139 [Test build #25090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25090/consoleFull) for PR 3891 at commit [`55636f3`](https://github.com/apache/spark/commit/55636f3764e93e997c15d2de46e7118af34989b2). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3891#issuecomment-68833133 [Test build #25090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25090/consoleFull) for PR 3891 at commit [`55636f3`](https://github.com/apache/spark/commit/55636f3764e93e997c15d2de46e7118af34989b2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala(override the...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3898#issuecomment-68833152 Can you also add a regression test to [`CachedTableSuite`](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala(override the...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3898#issuecomment-68833170 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala(override the...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3898#discussion_r22511002 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -427,6 +427,13 @@ private[hive] case class MetastoreRelation .getOrElse(sqlContext.defaultSizeInBytes)) } ) + override def sameResult(plan: LogicalPlan): Boolean = { +val new_plan = plan.asInstanceOf[MetastoreRelation]; --- End diff -- what about when its not a metastore relation? how about: ```scala plan match { case mr: MetastoreRelation = mr.databaseName == databaseName mr.tableName == tableName case _ = false } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala(override the...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3898#issuecomment-68833443 [Test build #25091 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25091/consoleFull) for PR 3898 at commit [`8d910aa`](https://github.com/apache/spark/commit/8d910aa45372c0cf0a3b2a4704b0e1a480bbeb1c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4226: Add support for subqueries in wher...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3888#issuecomment-68833871 This is simpler, but it has several disadvantages to the other approach: - The InSet it collected to the driver and thus could cause OOMs when large - I don't think that it handles correlated subqueries - The `execute()` involves eager evaluation and breaks RDD lineage For these reasons I think we should stick to extending the approach taken by the other PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3820#discussion_r22511350 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -141,6 +142,12 @@ private[sql] trait SQLConf { getConf(PARQUET_BINARY_AS_STRING, false).toBoolean /** + * When set to true, we always treat INT96Values in Parquet files as timestamp. + */ + private[spark] def isParquetINT96AsTimestamp: Boolean = +getConf(PARQUET_INT96_AS_TIMESTAMP, false).toBoolean --- End diff -- We don't really use INT96 for anything else (and I don't think other systems do either?) so maybe this should be true by default? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2165][YARN]add support for setting maxA...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3878#issuecomment-68681027 My opinion is that this is more of a general app property than an AM property, so I'd go for `spark.yarn.maxAppAttempts` instead. That also avoids confusion with the fact that elsewhere we've claimed the `spark.yarn.am.*` namespace for the yarn-client AM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][Mllib] Simplify loss function
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/3899 [Minor][Mllib] Simplify loss function This is a minor pr where I think that we can simply take minus of `margin` here, instead of subtracting `margin`. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 logit_func Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3899.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3899 commit 2bc5712a6fe374c0821233edab719b58d31ab01b Author: Liang-Chi Hsieh vii...@gmail.com Date: 2015-01-05T09:13:07Z Simplify loss function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][Mllib] Simplify loss function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3899#issuecomment-68686539 [Test build #25053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25053/consoleFull) for PR 3899 at commit [`2bc5712`](https://github.com/apache/spark/commit/2bc5712a6fe374c0821233edab719b58d31ab01b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user Bilna commented on the pull request: https://github.com/apache/spark/pull/3844#issuecomment-68686800 @tdas, Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][Mllib] Simplify loss function
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3899#issuecomment-68687443 +1 looks like a small good improvement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user Bilna commented on a diff in the pull request: https://github.com/apache/spark/pull/3844#discussion_r22453522 --- Diff: external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala --- @@ -17,31 +17,111 @@ package org.apache.spark.streaming.mqtt -import org.scalatest.FunSuite +import java.net.{URI, ServerSocket} -import org.apache.spark.streaming.{Seconds, StreamingContext} +import org.apache.activemq.broker.{TransportConnector, BrokerService} +import org.apache.spark.util.Utils +import org.scalatest.{BeforeAndAfter, FunSuite} +import org.scalatest.concurrent.Eventually +import scala.concurrent.duration._ +import org.apache.spark.streaming.{Milliseconds, StreamingContext} import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.dstream.ReceiverInputDStream +import org.eclipse.paho.client.mqttv3._ +import org.eclipse.paho.client.mqttv3.persist.MqttDefaultFilePersistence -class MQTTStreamSuite extends FunSuite { - - val batchDuration = Seconds(1) +class MQTTStreamSuite extends FunSuite with Eventually with BeforeAndAfter { + private val batchDuration = Milliseconds(500) private val master: String = local[2] - private val framework: String = this.getClass.getSimpleName + private val freePort = findFreePort() + private val brokerUri = //localhost: + freePort + private val topic = def + private var ssc: StreamingContext = _ + private val persistenceDir = Utils.createTempDir() + private var broker: BrokerService = _ + private var connector: TransportConnector = _ - test(mqtt input stream) { -val ssc = new StreamingContext(master, framework, batchDuration) -val brokerUrl = abc -val topic = def + before { +ssc = new StreamingContext(master, framework, batchDuration) +setupMQTT() + } -// tests the API, does not actually test data receiving -val test1: ReceiverInputDStream[String] = MQTTUtils.createStream(ssc, brokerUrl, topic) -val test2: ReceiverInputDStream[String] = - MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_AND_DISK_SER_2) + after { +if (ssc != null) { + ssc.stop() + ssc = null +} +Utils.deleteRecursively(persistenceDir) +tearDownMQTT() + } -// TODO: Actually test receiving data + test(mqtt input stream) { +val sendMessage = MQTT demo for spark streaming +val receiveStream: ReceiverInputDStream[String] = + MQTTUtils.createStream(ssc, tcp: + brokerUri, topic, StorageLevel.MEMORY_ONLY) +var receiveMessage: List[String] = List() +receiveStream.foreachRDD { rdd = + if (rdd.collect.length 0) { +receiveMessage = receiveMessage ::: List(rdd.first) +receiveMessage + } +} +ssc.start() +publishData(sendMessage) +eventually(timeout(1 milliseconds), interval(100 milliseconds)) { + assert(sendMessage.equals(receiveMessage(0))) +} ssc.stop() } + + private def setupMQTT() { +broker = new BrokerService() +connector = new TransportConnector() +connector.setName(mqtt) +connector.setUri(new URI(mqtt: + brokerUri)) +broker.addConnector(connector) +broker.start() + } + + private def tearDownMQTT() { +if (broker != null) { + broker.stop() + broker = null +} +if (connector != null) { + connector.stop() + connector = null +} + } + + private def findFreePort(): Int = { +Utils.startServiceOnPort(23456, (trialPort: Int) = { + val socket = new ServerSocket(trialPort) + socket.close() + (null, trialPort) +})._2 + } + + def publishData(data: String): Unit = { +var client: MqttClient = null +try { + val persistence: MqttClientPersistence = new MqttDefaultFilePersistence(persistenceDir.getAbsolutePath) + client = new MqttClient(tcp: + brokerUri, MqttClient.generateClientId(), persistence) + client.connect() + if (client.isConnected) { +val msgTopic: MqttTopic = client.getTopic(topic) +val message: MqttMessage = new MqttMessage(data.getBytes(utf-8)) +message.setQos(1) +message.setRetained(true) +for (i - 0 to 100) + msgTopic.publish(message) --- End diff -- Can you explain what is the correction here. Just to understand what went wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes
[GitHub] spark pull request: [SPARK-1600] Refactor FileInputStream tests to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3801#issuecomment-68688402 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25052/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1600] Refactor FileInputStream tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3801#issuecomment-68688395 [Test build #25052 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25052/consoleFull) for PR 3801 at commit [`b4442c3`](https://github.com/apache/spark/commit/b4442c3538ad462a0a7d39f4b2049ed230e92665). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][Mllib] Simplify loss function
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3899#issuecomment-68693770 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25053/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2165][YARN]add support for setting maxA...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/3878#issuecomment-68693775 @sryza Thanks. That makes sense. @tgravescs How do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][Mllib] Simplify loss function
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3899#issuecomment-68693757 [Test build #25053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25053/consoleFull) for PR 3899 at commit [`2bc5712`](https://github.com/apache/spark/commit/2bc5712a6fe374c0821233edab719b58d31ab01b). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING] SPARK-4986 Wait for receivers to d...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/3868#issuecomment-68689798 I am happy to review the code if you take a pass on implementing (2). I can jump in if things get too hairy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-68690112 [Test build #25054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25054/consoleFull) for PR 3222 at commit [`38efd6d`](https://github.com/apache/spark/commit/38efd6d6df6b53ec1f16490b37f3fb1cc0cce053). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4504] Minor bug fixes in bin/run-exampl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3069#issuecomment-68692219 [Test build #25055 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25055/consoleFull) for PR 3069 at commit [`24905b7`](https://github.com/apache/spark/commit/24905b7793fc1fc8802b370c63caf81671475c31). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4908][SQL] Prevent multiple concurrent ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3834#issuecomment-68692301 Just for reference, the root cause behind this issue is discussed in [SPARK-4908] [1]. [1]: https://issues.apache.org/jira/browse/SPARK-4908?focusedCommentId=14264469page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14264469 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala
Github user seayi commented on the pull request: https://github.com/apache/spark/pull/3898#issuecomment-68683132 override the sameresult method only compare databasename and table name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala
GitHub user seayi opened a pull request: https://github.com/apache/spark/pull/3898 Update HiveMetastoreCatalog.scala modify the sameresult method only compare databasename and table name because in previous : cache table t1; select count(*) from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count(*) from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan so modify the sameresult method only compare databasename and table name You can merge this pull request into a Git repository by running: $ git pull https://github.com/seayi/spark branch-1.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3898.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3898 commit 8d910aa45372c0cf0a3b2a4704b0e1a480bbeb1c Author: seayi 405078...@qq.com Date: 2015-01-05T09:02:51Z Update HiveMetastoreCatalog.scala in previous : cache table t1; select count(*) from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count(*) from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when with alias the same table 's logicalplan is not the same logical plan so modify the sameresult method only compare databasename and table name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING] SPARK-4986 Wait for receivers to d...
Github user cleaton commented on the pull request: https://github.com/apache/spark/pull/3868#issuecomment-68678753 @tdas Thank you for the input. Yes, the main purpose of this patch is to make ReceiverTracker graceful by waiting for ssc.sparkContext.runJob(tempRDD, ssc.sparkContext.clean(startReceiver)) to terminate and all receivers to de-register (possible redundant?). I borrowed the aproach used in JobGenerator and you are right I forgot to keep timeWhenStopStarted global. The second approach sounds good to me. Would make it easier to follow the shutdown sequence if it is consolidated in one place. And for unit test my idea is to create a dummy receiver implementation that blocks on shutdown while still producing a fixed number of records. Do you think you or someone else working more closely with spark streaming should take over this patch? Seems it is about deciding which approach is best suited for spark in the long run. I can still try to provide a unit test for this though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5088] Use spark-class for running execu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3897#issuecomment-68680860 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25051/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5088] Use spark-class for running execu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3897#issuecomment-68680857 [Test build #25051 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25051/consoleFull) for PR 3897 at commit [`ed906b7`](https://github.com/apache/spark/commit/ed906b7ae16c08521f8e3b4aa3da47fd293abff3). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` val executorPath = new File(executorSparkHome, s/bin/spark-class $executorBackendName)` * ` command.setValue(scd $` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1600] Refactor FileInputStream tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3801#issuecomment-68683294 [Test build #25052 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25052/consoleFull) for PR 3801 at commit [`b4442c3`](https://github.com/apache/spark/commit/b4442c3538ad462a0a7d39f4b2049ed230e92665). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3898#issuecomment-68683258 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update HiveMetastoreCatalog.scala
Github user seayi commented on the pull request: https://github.com/apache/spark/pull/3898#issuecomment-68683408 i test with hive table,after modify the sameresult method it is ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1600] Refactor FileInputStream tests to...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3801#issuecomment-68684233 Pushed some commits addressing most of the feedback, but I'm still struggling to remove that last `Thread.sleep(1000)`. I think that the problem here is that the writing of the checkpoint is asynchronous and without the sleep, we wind up in a state where batch 3 has started processing but has not finished, and the StreamingContext shuts down before a snapshot including batch 3's file info is written. I plan to dig into this tomorrow to see whether this is actually the case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...
Github user liyezhang556520 commented on the pull request: https://github.com/apache/spark/pull/3825#issuecomment-68684212 @JoshRosen , If we want to use the supervision mechanism. We need to add another actor level as parent of the current Master actor. I don't know if that is suitable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-68696993 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25054/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-68696985 [Test build #25054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25054/consoleFull) for PR 3222 at commit [`38efd6d`](https://github.com/apache/spark/commit/38efd6d6df6b53ec1f16490b37f3fb1cc0cce053). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AdaGradUpdater(` * `class DBN(val stackedRBM: StackedRBM)` * `class MLP(` * `class MomentumUpdater(val momentum: Double) extends Updater ` * `class RBM(` * `class StackedRBM(val innerRBMs: Array[RBM])` * `case class MinstItem(label: Int, data: Array[Int]) ` * `class MinstDatasetReader(labelsFile: String, imagesFile: String)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3874#issuecomment-68698508 This one always confuses me, but here's what I think I know: The compiled `Optional` in Spark won't have the correct (meaning, matching the Google Guava `Optional`) signatures on its methods since other Guava classes are shaded. It's there for the Spark API to compile against the Guava class in the package that the user app expects. Apps that uses the API method that uses `Optional` will be bundling Guava. Spark uses Guava 14, although in theory you can use any version... that still has the very few methods on `Optional` that Spark actually calls, I suppose. Because Spark will be using the user app's copy of `Optional` at runtime. You say you tried it and got `java.lang.NoClassDefFoundError: org/apache/spark/Partition` though. That's weird. `Optional` will be in a different classloader (?) but shouldn't refer to Spark classes. Right? if there's a problem it's somewhere in there, since that's where how I thought this works seems to not match your experience. Or else, there's something else subtly not-quite-right about how the user app is run here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4504] Minor bug fixes in bin/run-exampl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3069#issuecomment-68698622 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25055/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4504] Minor bug fixes in bin/run-exampl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3069#issuecomment-68698616 [Test build #25055 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25055/consoleFull) for PR 3069 at commit [`24905b7`](https://github.com/apache/spark/commit/24905b7793fc1fc8802b370c63caf81671475c31). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5036][Graphx]Better support sending par...
Github user shijinkui commented on the pull request: https://github.com/apache/spark/pull/3866#issuecomment-68710366 @ankurdave @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5073] spark.storage.memoryMapThreshold ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3900#issuecomment-68711124 [Test build #25056 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25056/consoleFull) for PR 3900 at commit [`fcce2e5`](https://github.com/apache/spark/commit/fcce2e593754fd8bfe389df0c4ddf79cceebdd97). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5073] spark.storage.memoryMapThreshold ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3900#issuecomment-68711139 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25056/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4764] Ensure that files are fetched ato...
Github user preaudc commented on the pull request: https://github.com/apache/spark/pull/2855#issuecomment-68702554 Yes, my bad, {{targetDir}} is indeed already a {{File}}. @JoshRosen , how could I fix this, should I create a new pull request, or can this one be reopened? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5073] spark.storage.memoryMapThreshold ...
GitHub user Lewuathe opened a pull request: https://github.com/apache/spark/pull/3900 [SPARK-5073] spark.storage.memoryMapThreshold have two default value Because major OS page sizes is about 4KB, the default value of spark.storage.memoryMapThreshold is integrated to 2 * 4096 You can merge this pull request into a Git repository by running: $ git pull https://github.com/Lewuathe/spark integrate-memoryMapThreshold Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3900.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3900 commit fcce2e593754fd8bfe389df0c4ddf79cceebdd97 Author: lewuathe lewua...@me.com Date: 2015-01-05T12:47:28Z [SPARK-5073] spark.storage.memoryMapThreshold have two default value --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4631] unit test for MQTT
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/3844#discussion_r22459212 --- Diff: external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala --- @@ -17,31 +17,111 @@ package org.apache.spark.streaming.mqtt -import org.scalatest.FunSuite +import java.net.{URI, ServerSocket} -import org.apache.spark.streaming.{Seconds, StreamingContext} +import org.apache.activemq.broker.{TransportConnector, BrokerService} +import org.apache.spark.util.Utils +import org.scalatest.{BeforeAndAfter, FunSuite} +import org.scalatest.concurrent.Eventually +import scala.concurrent.duration._ +import org.apache.spark.streaming.{Milliseconds, StreamingContext} import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.dstream.ReceiverInputDStream +import org.eclipse.paho.client.mqttv3._ +import org.eclipse.paho.client.mqttv3.persist.MqttDefaultFilePersistence -class MQTTStreamSuite extends FunSuite { - - val batchDuration = Seconds(1) +class MQTTStreamSuite extends FunSuite with Eventually with BeforeAndAfter { + private val batchDuration = Milliseconds(500) private val master: String = local[2] - private val framework: String = this.getClass.getSimpleName + private val freePort = findFreePort() + private val brokerUri = //localhost: + freePort + private val topic = def + private var ssc: StreamingContext = _ + private val persistenceDir = Utils.createTempDir() + private var broker: BrokerService = _ + private var connector: TransportConnector = _ - test(mqtt input stream) { -val ssc = new StreamingContext(master, framework, batchDuration) -val brokerUrl = abc -val topic = def + before { +ssc = new StreamingContext(master, framework, batchDuration) +setupMQTT() + } -// tests the API, does not actually test data receiving -val test1: ReceiverInputDStream[String] = MQTTUtils.createStream(ssc, brokerUrl, topic) -val test2: ReceiverInputDStream[String] = - MQTTUtils.createStream(ssc, brokerUrl, topic, StorageLevel.MEMORY_AND_DISK_SER_2) + after { +if (ssc != null) { + ssc.stop() + ssc = null +} +Utils.deleteRecursively(persistenceDir) +tearDownMQTT() + } -// TODO: Actually test receiving data + test(mqtt input stream) { +val sendMessage = MQTT demo for spark streaming +val receiveStream: ReceiverInputDStream[String] = + MQTTUtils.createStream(ssc, tcp: + brokerUri, topic, StorageLevel.MEMORY_ONLY) +var receiveMessage: List[String] = List() +receiveStream.foreachRDD { rdd = + if (rdd.collect.length 0) { +receiveMessage = receiveMessage ::: List(rdd.first) +receiveMessage + } +} +ssc.start() +publishData(sendMessage) +eventually(timeout(1 milliseconds), interval(100 milliseconds)) { + assert(sendMessage.equals(receiveMessage(0))) +} ssc.stop() } + + private def setupMQTT() { +broker = new BrokerService() +connector = new TransportConnector() +connector.setName(mqtt) +connector.setUri(new URI(mqtt: + brokerUri)) +broker.addConnector(connector) +broker.start() + } + + private def tearDownMQTT() { +if (broker != null) { + broker.stop() + broker = null +} +if (connector != null) { + connector.stop() + connector = null +} + } + + private def findFreePort(): Int = { +Utils.startServiceOnPort(23456, (trialPort: Int) = { + val socket = new ServerSocket(trialPort) + socket.close() + (null, trialPort) +})._2 + } + + def publishData(data: String): Unit = { +var client: MqttClient = null +try { + val persistence: MqttClientPersistence = new MqttDefaultFilePersistence(persistenceDir.getAbsolutePath) + client = new MqttClient(tcp: + brokerUri, MqttClient.generateClientId(), persistence) + client.connect() + if (client.isConnected) { +val msgTopic: MqttTopic = client.getTopic(topic) +val message: MqttMessage = new MqttMessage(data.getBytes(utf-8)) +message.setQos(1) +message.setRetained(true) +for (i - 0 to 100) + msgTopic.publish(message) --- End diff -- ``` for (...) { msgTopic.publish(message) } ``` Such code block should either be in one line or be within braces. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If
[GitHub] spark pull request: [SPARK-5073] spark.storage.memoryMapThreshold ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3900#issuecomment-68703864 [Test build #25056 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25056/consoleFull) for PR 3900 at commit [`fcce2e5`](https://github.com/apache/spark/commit/fcce2e593754fd8bfe389df0c4ddf79cceebdd97). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4504][Examples] fix run-example failure...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3377#issuecomment-68706124 There was already a PR for this: https://github.com/apache/spark/pull/3069 But it seems to be fixing a different root cause, that the assembly generated by SBT and Maven are named differently. I am not sure if this also addresses this problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4504] Minor bug fixes in bin/run-exampl...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3069#issuecomment-68706026 I think this has been superseded by the discussion in https://github.com/apache/spark/pull/3377 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING] SPARK-4986 Wait for receivers to d...
Github user cleaton commented on the pull request: https://github.com/apache/spark/pull/3868#issuecomment-68707568 OK sounds great. :+1: I can prepare an implementation of (2). Bit busy now, but I think I can have something to review in a week. Any specific unit test you can suggest for me to take a look at? The existing receiver tests? Thanks :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4644][Core] Implement skewed join
Github user lianhuiwang commented on a diff in the pull request: https://github.com/apache/spark/pull/3505#discussion_r22460587 --- Diff: core/src/main/scala/org/apache/spark/rdd/SkewedJoinRDD.scala --- @@ -0,0 +1,345 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.rdd + +import java.io.{ObjectOutputStream, IOException} + +import scala.reflect.ClassTag + +import org.apache.spark._ +import org.apache.spark.serializer.Serializer +import org.apache.spark.shuffle.ShuffleHandle +import org.apache.spark.util.Utils +import org.apache.spark.util.collection._ + +private[spark] sealed trait JoinType[K, L, R, PAIR : Product2[_, _]] extends Serializable { + + def flatten(i: Iterator[(K, (Iterable[Chunk[L]], Iterable[Chunk[R]]))]): Iterator[(K, PAIR)] + +} + +private[spark] object JoinType { + + def inner[K, L, R] = new JoinType[K, L, R, (L, R)] { + +override def flatten(i: Iterator[(K, (Iterable[Chunk[L]], Iterable[Chunk[R]]))]) = + i flatMap { +case (key, pair) = { + if (pair._1.size pair._2.size) { +yieldPair(pair._1, pair._2, key, (p1: L, p2: R) = (p1, p2)) + } else { +yieldPair(pair._2, pair._1, key, (p2: R, p1: L) = (p1, p2)) + } +} + } + } + + def leftOuter[K, L, R] = new JoinType[K, L, R, (L, Option[R])] { + +override def flatten(i: Iterator[(K, (Iterable[Chunk[L]], Iterable[Chunk[R]]))]) = + i flatMap { +case (key, pair) = { + if (pair._2.size == 0) { +for (chunk - pair._1.iterator; + v - chunk +) yield (key, (v, None)): (K, (L, Option[R])) + } + else if (pair._1.size pair._2.size) { +yieldPair(pair._1, pair._2, key, (p1: L, p2: R) = (p1, Some(p2))) + } else { +yieldPair(pair._2, pair._1, key, (p2: R, p1: L) = (p1, Some(p2))) + } +} + } + } + + def rightOuter[K, L, R] = new JoinType[K, L, R, (Option[L], R)] { + +override def flatten(i: Iterator[(K, (Iterable[Chunk[L]], Iterable[Chunk[R]]))]) = + i flatMap { +case (key, pair) = { + if (pair._1.size == 0) { +for (chunk - pair._2.iterator; + v - chunk +) yield (key, (None, v)): (K, (Option[L], R)) + } + else if (pair._1.size pair._2.size) { +yieldPair(pair._1, pair._2, key, (p1: L, p2: R) = (Some(p1), p2)) + } else { +yieldPair(pair._2, pair._1, key, (p2: R, p1: L) = (Some(p1), p2)) + } +} + } + } + + def fullOuter[K, L, R] = new JoinType[K, L, R, (Option[L], Option[R])] { + +override def flatten(i: Iterator[(K, (Iterable[Chunk[L]], Iterable[Chunk[R]]))]) = + i flatMap { +case (key, pair) = { + if (pair._1.size == 0) { +for (chunk - pair._2.iterator; + v - chunk +) yield (key, (None, Some(v))): (K, (Option[L], Option[R])) + } + else if (pair._2.size == 0) { +for (chunk - pair._1.iterator; + v - chunk +) yield (key, (Some(v), None)): (K, (Option[L], Option[R])) + } + else if (pair._1.size pair._2.size) { +yieldPair(pair._1, pair._2, key, (p1: L, p2: R) = (Some(p1), Some(p2))) + } else { +yieldPair(pair._2, pair._1, key, (p2: R, p1: L) = (Some(p1), Some(p2))) + } +} + } + } + + private def yieldPair[K, OUT, IN, PAIR : Product2[_, _]]( + outer: Iterable[Chunk[OUT]], inner: Iterable[Chunk[IN]], key: K, toPair: (OUT, IN) = PAIR) = +for ( + outerChunk - outer.iterator; + innerChunk - inner.iterator; +
[GitHub] spark pull request: [SPARK-1953][YARN]yarn client mode Application...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/3607#issuecomment-68733624 @andrewor14 What to do now? @vanzin @sryza @tgravescs Someone has any better idea? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5006][Deploy]spark.port.maxRetries does...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/3841#issuecomment-68734238 @andrewor14 Could you take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5057]Add more details in log when using...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/3875#issuecomment-68734038 @JoshRosen Then is it ok? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-68741897 @dbtsai Just back from vacation too:) I used my old implementation of the matrix form of back propagation and made sure that it properly uses stride of matrices in breeze. Also, I optimized roll of parameters into vector combined with in-place update of cumulative sum. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/3638#discussion_r22491893 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -256,15 +256,21 @@ private[spark] class TaskSchedulerImpl( val execId = shuffledOffers(i).executorId val host = shuffledOffers(i).host if (availableCpus(i) = CPUS_PER_TASK) { -for (task - taskSet.resourceOffer(execId, host, maxLocality)) { - tasks(i) += task - val tid = task.taskId - taskIdToTaskSetId(tid) = taskSet.taskSet.id - taskIdToExecutorId(tid) = execId - executorsByHost(host) += execId - availableCpus(i) -= CPUS_PER_TASK - assert(availableCpus(i) = 0) - launchedTask = true +try { + for (task - taskSet.resourceOffer(execId, host, maxLocality)) { +tasks(i) += task +val tid = task.taskId +taskIdToTaskSetId(tid) = taskSet.taskSet.id +taskIdToExecutorId(tid) = execId +executorsByHost(host) += execId +availableCpus(i) -= CPUS_PER_TASK +assert(availableCpus(i) = 0) +launchedTask = true + } +} catch { + case e: TaskNotSerializableException = { --- End diff -- What about scenarios where you have multiple concurrent jobs (e.g. in an environment like Databricks Cloud, Spark Jobserver, etc)? I agree that the job associated with this task set is doomed, but other jobs should still be able to make progress and those jobs' task sets might still be schedulable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-68787775 @yu-iskw @rnowling, I asked @freeman-lab to make one pass on this PR. Let's ping him :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3638#issuecomment-68787801 [Test build #25064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25064/consoleFull) for PR 3638 at commit [`b2a430d`](https://github.com/apache/spark/commit/b2a430d9f3fc8dac3c2a20aab6bd07bae8f17691). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/3711#issuecomment-68787824 This looks like a legitimate test failure. Ther AMPLab webserver is having some issues today, so here's a different link to reach the same test result: https://hadrian.ist.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25060/testReport/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user ksakellis commented on the pull request: https://github.com/apache/spark/pull/3711#issuecomment-68788875 Yes it is. Not sure why I changed the #of cores between the two commits in the unit test - weird. Anyways. it has been fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3711#issuecomment-68789356 [Test build #25065 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25065/consoleFull) for PR 3711 at commit [`776d743`](https://github.com/apache/spark/commit/776d743c16cf3506b5c26983836c90066c365ee7). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3711#discussion_r22495998 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorDetails.scala --- @@ -0,0 +1,29 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.scheduler.cluster + +import org.apache.spark.annotation.DeveloperApi + +/** + * :: DeveloperApi :: + * Stores information about an executor to pass from the scheduler to SparkListeners. + */ +@DeveloperApi +class ExecutorDetails( --- End diff -- Should this be `ExecutorInfo`, consistent with `TaskInfo` and `StageInfo`, etc? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3711#issuecomment-68798860 [Test build #25065 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25065/consoleFull) for PR 3711 at commit [`776d743`](https://github.com/apache/spark/commit/776d743c16cf3506b5c26983836c90066c365ee7). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class SparkListenerAdapter implements SparkListener ` * `case class SparkListenerExecutorAdded(executorId: String, executorDetails: ExecutorDetails)` * `case class SparkListenerExecutorRemoved(executorId: String, executorDetails: ExecutorDetails)` * `class ExecutorDetails(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5073] spark.storage.memoryMapThreshold ...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/3900#issuecomment-68801129 Agree with 2MB with the caveat that this could cause some slowdown for the other code path (reading cache blocks from disk). However, memory mapping frequently can be dangerous (it was actually due to the JVM dying with a SIGBUS error that we added the threshold in the shuffle code path), which seems like a worse problem to have than the possibility of an undetermined slowdown. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1600] Refactor FileInputStream tests to...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/3801#discussion_r22491068 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala --- @@ -319,102 +318,141 @@ class CheckpointSuite extends TestSuiteBase { // failure, are re-processed or not. test(recovery with file input stream) { // Set up the streaming context and input streams +val batchDuration = Seconds(2) // Due to 1-second resolution of setLastModified() on some OS's. val testDir = Utils.createTempDir() -var ssc = new StreamingContext(master, framework, Seconds(1)) -ssc.checkpoint(checkpointDir) -val fileStream = ssc.textFileStream(testDir.toString) -// Making value 3 take large time to process, to ensure that the master -// shuts down in the middle of processing the 3rd batch -val mappedStream = fileStream.map(s = { - val i = s.toInt - if (i == 3) Thread.sleep(2000) - i -}) - -// Reducing over a large window to ensure that recovery from master failure -// requires reprocessing of all the files seen before the failure -val reducedStream = mappedStream.reduceByWindow(_ + _, Seconds(30), Seconds(1)) -val outputBuffer = new ArrayBuffer[Seq[Int]] -var outputStream = new TestOutputStream(reducedStream, outputBuffer) -outputStream.register() -ssc.start() - -// Create files and advance manual clock to process them -// var clock = ssc.scheduler.clock.asInstanceOf[ManualClock] -Thread.sleep(1000) -for (i - Seq(1, 2, 3)) { - Files.write(i + \n, new File(testDir, i.toString), Charset.forName(UTF-8)) - // wait to make sure that the file is written such that it gets shown in the file listings - Thread.sleep(1000) +val outputBuffer = new ArrayBuffer[Seq[Int]] with SynchronizedBuffer[Seq[Int]] + +def writeFile(i: Int, clock: ManualClock): Unit = { + val file = new File(testDir, i.toString) + Files.write(i + \n, file, Charsets.UTF_8) + assert(file.setLastModified(clock.currentTime())) + // Check that the file's modification date is actually the value we wrote, since rounding or + // truncation will break the test: + assert(file.lastModified() === clock.currentTime()) } -logInfo(Output = + outputStream.output.mkString(,)) -assert(outputStream.output.size 0, No files processed before restart) -ssc.stop() -// Verify whether files created have been recorded correctly or not -var fileInputDStream = ssc.graph.getInputStreams().head.asInstanceOf[FileInputDStream[_, _, _]] -def recordedFiles = fileInputDStream.batchTimeToSelectedFiles.values.flatten -assert(!recordedFiles.filter(_.endsWith(1)).isEmpty) -assert(!recordedFiles.filter(_.endsWith(2)).isEmpty) -assert(!recordedFiles.filter(_.endsWith(3)).isEmpty) - -// Create files while the master is down -for (i - Seq(4, 5, 6)) { - Files.write(i + \n, new File(testDir, i.toString), Charset.forName(UTF-8)) - Thread.sleep(1000) +def recordedFiles(ssc: StreamingContext): Seq[Int] = { + val fileInputDStream = +ssc.graph.getInputStreams().head.asInstanceOf[FileInputDStream[_, _, _]] + val filenames = fileInputDStream.batchTimeToSelectedFiles.values.flatten + filenames.map(_.split(File.separator).last.toInt).toSeq.sorted } -// Recover context from checkpoint file and verify whether the files that were -// recorded before failure were saved and successfully recovered -logInfo(*** RESTARTING ) -ssc = new StreamingContext(checkpointDir) -fileInputDStream = ssc.graph.getInputStreams().head.asInstanceOf[FileInputDStream[_, _, _]] -assert(!recordedFiles.filter(_.endsWith(1)).isEmpty) -assert(!recordedFiles.filter(_.endsWith(2)).isEmpty) -assert(!recordedFiles.filter(_.endsWith(3)).isEmpty) +try { + // This is a var because it's re-assigned when we restart from a checkpoint + var clock: ManualClock = null + withStreamingContext(new StreamingContext(conf, batchDuration)) { ssc = +ssc.checkpoint(checkpointDir) +clock = ssc.scheduler.clock.asInstanceOf[ManualClock] +val waiter = new StreamingTestWaiter(ssc) +val fileStream = ssc.textFileStream(testDir.toString) +// Make value 3 take a large time to process, to ensure that the driver +// shuts down in the middle of processing the 3rd batch +val mappedStream = fileStream.map(s = { + val i = s.toInt + if (i == 3) Thread.sleep(4000) + i +
[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/3874#issuecomment-68786110 Latest version LGTM. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3874#issuecomment-68789982 Although further creep of the unshading-of-the-shading feels risky, it seems to resolve a problem, and is in principle OK on the same grounds that unshading `Optional` is. I'm still puzzled about why using Guava 14 isn't working but wouldn't argue with solving a problem this way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3803#issuecomment-68795630 [Test build #25066 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25066/consoleFull) for PR 3803 at commit [`317b6d1`](https://github.com/apache/spark/commit/317b6d1dc45f0706987c3258beaa64be08df4b3c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3711#issuecomment-68796368 I had some minor comments around naming, but overall this looks good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3638#issuecomment-68797550 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25064/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4737] Task set manager properly handles...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3638#issuecomment-68797542 [Test build #25064 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25064/consoleFull) for PR 3638 at commit [`b2a430d`](https://github.com/apache/spark/commit/b2a430d9f3fc8dac3c2a20aab6bd07bae8f17691). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4921. TaskSetManager.dequeueTask returns...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3816#issuecomment-68798497 My conclusion (sorry if it was unclear above) was that dequeueTask returning NO_PREF instead of PROCESS_LOCAL should have no effect at all. I still think it's worth changing for clarity, but it's obviously less important. In what cases would dequeueTask return NO_PREF tasks when maxLocality=RACK_LOCAL or ANY? My understanding is that, in a single resourceOffers call, dequeueTask gets called multiple times in order of TaskLocality, so any NO_PREF tasks would be returned when it's called with maxLocality=NO_PREF. And none would remain when ANY and RACK_LOCAL come around. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3711#issuecomment-68798867 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25065/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5040][SQL] Support expressing unresolve...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3862#issuecomment-68801752 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-68794407 Hey all, thanks for the nudge =) I've been going through it, will get you feedback ASAP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3711#discussion_r22495242 --- Diff: core/src/main/java/org/apache/spark/SparkListenerAdapter.java --- @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark; + +import org.apache.spark.scheduler.SparkListener; +import org.apache.spark.scheduler.SparkListenerApplicationEnd; +import org.apache.spark.scheduler.SparkListenerApplicationStart; +import org.apache.spark.scheduler.SparkListenerBlockManagerAdded; +import org.apache.spark.scheduler.SparkListenerBlockManagerRemoved; +import org.apache.spark.scheduler.SparkListenerEnvironmentUpdate; +import org.apache.spark.scheduler.SparkListenerExecutorAdded; +import org.apache.spark.scheduler.SparkListenerExecutorMetricsUpdate; +import org.apache.spark.scheduler.SparkListenerExecutorRemoved; +import org.apache.spark.scheduler.SparkListenerJobEnd; +import org.apache.spark.scheduler.SparkListenerJobStart; +import org.apache.spark.scheduler.SparkListenerStageCompleted; +import org.apache.spark.scheduler.SparkListenerStageSubmitted; +import org.apache.spark.scheduler.SparkListenerTaskEnd; +import org.apache.spark.scheduler.SparkListenerTaskGettingResult; +import org.apache.spark.scheduler.SparkListenerTaskStart; +import org.apache.spark.scheduler.SparkListenerUnpersistRDD; + +/** + * Java clients should extend this class instead of implementing + * SparkListener directly. This is to prevent java clients + * from breaking when new events are added to the SparkListener + * trait. + * + * This is a concrete class instead of abstract to enforce + * new events get added to both the SparkListener and this adapter + * in lockstep. + */ +public class SparkListenerAdapter implements SparkListener { --- End diff -- What about calling this `JavaSparkListener` - that's what we've tended to use in the past for things that were supposed to be drop in substitutes for Scala classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5093] Set spark.network.timeout to 120s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3903#issuecomment-68794632 [Test build #25063 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25063/consoleFull) for PR 3903 at commit [`7c2138e`](https://github.com/apache/spark/commit/7c2138e8c5353490d5037e92e45d2516f34b3170). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` SparkSubmit.printErrorAndExit(sCannot load main class from JAR $primaryResource)` * `class BinaryClassificationMetrics(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...
Github user freeman-lab commented on a diff in the pull request: https://github.com/apache/spark/pull/3803#discussion_r22495402 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala --- @@ -233,6 +236,47 @@ class InputStreamsSuite extends TestSuiteBase with BeforeAndAfter { } } + def testBinaryRecordsStream() { +var ssc: StreamingContext = null +val testDir: File = null +try { + val testDir = Utils.createTempDir() + + Thread.sleep(1000) + // Set up the streaming context and input streams + val newConf = conf.clone.set( +spark.streaming.clock, org.apache.spark.streaming.util.SystemClock) --- End diff -- Ok great, I'll wait for your PR to be merged and then refactor this test accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5093] Set spark.network.timeout to 120s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3903#issuecomment-68794641 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25063/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4286] Integrate external shuffle servic...
Github user tnachen commented on the pull request: https://github.com/apache/spark/pull/3861#issuecomment-68795001 @aarondav --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...
Github user freeman-lab commented on a diff in the pull request: https://github.com/apache/spark/pull/3803#discussion_r22495425 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala --- @@ -233,6 +236,47 @@ class InputStreamsSuite extends TestSuiteBase with BeforeAndAfter { } } + def testBinaryRecordsStream() { --- End diff -- Good catch, fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...
Github user freeman-lab commented on a diff in the pull request: https://github.com/apache/spark/pull/3803#discussion_r22496437 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala --- @@ -373,6 +393,25 @@ class StreamingContext private[streaming] ( } /** + * Create an input stream that monitors a Hadoop-compatible filesystem + * for new files and reads them as flat binary files, assuming a fixed length per record, + * generating one byte array per record. Files must be written to the monitored directory + * by moving them from another location within the same file system. File names + * starting with . are ignored. + * @param directory HDFS directory to monitor for new file + * @param recordLength length of each record in bytes + */ + def binaryRecordsStream( + directory: String, + recordLength: Int): DStream[Array[Byte]] = { +val conf = sc_.hadoopConfiguration +conf.setInt(FixedLengthBinaryInputFormat.RECORD_LENGTH_PROPERTY, recordLength) +val br = fileStream[LongWritable, BytesWritable, FixedLengthBinaryInputFormat](directory, conf) +val data = br.map{ case (k, v) = v.getBytes} --- End diff -- Thanks for flagging this, I wasn't aware of that behavior of ``getBytes``. I think that, as you suggest, both here and in ``binaryRecords()`` it's not a problem in practice. The BytesWritable that comes from the FixedLengthBinaryInputFormat will always be backed by a Byte array that's of the fixed length. For consistency and good practice I'm happy to make the change from #2712 both here and in the other method. Or we just add a comment. Let me know which you'd prefer. One concern might be performance effects, as mentioned in the other PR? @sryza might have thoughts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5093] Set spark.network.timeout to 120s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3903 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5040][SQL] Support expressing unresolve...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3862 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4804] StringContext method to allow Str...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3649#issuecomment-68802004 Now that #3862 has been merged, can we close this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5052] Add common/base classes to fix gu...
Github user elmer-garduno commented on the pull request: https://github.com/apache/spark/pull/3874#issuecomment-68785782 Thanks, that worked, I updated the PR to reflect those changes. And here is a list of the actual classes that get included into the jar: jar tf assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.2.1.jar |grep com.google.common.base com/google/common/base/ com/google/common/base/Absent.class com/google/common/base/Function.class com/google/common/base/Optional$1$1.class com/google/common/base/Optional$1.class com/google/common/base/Optional.class com/google/common/base/Present.class com/google/common/base/Supplier.class --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Support for Mesos DockerInfo
Github user tnachen commented on the pull request: https://github.com/apache/spark/pull/3074#issuecomment-68791982 @ash211 Can you take a look at this patch again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2572] Delete the local dir on executor ...
Github user tnachen commented on the pull request: https://github.com/apache/spark/pull/1480#issuecomment-68792088 @watermen Can you update the patch as @andrewor14 mentioned? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3711#discussion_r22495870 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -106,6 +108,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val actorSyste logDebug(sDecremented number of pending executors ($numPendingExecutors left)) } } + listenerBus.post(SparkListenerExecutorAdded(executorId, data)) --- End diff -- Don't we need to do the equivalent in `MesosSchedulerBackend`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/3803#issuecomment-68797365 Thanks for the review! I'll wait for @JoshRosen 's PR to merge and then update the test here. And will wait for your thoughts on the `getBytes` issue. Otherwise, I think everything's addressed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/3798#discussion_r22498258 --- Diff: external/kafka/src/main/scala/org/apache/spark/rdd/kafka/KafkaRDD.scala --- @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.rdd.kafka + +import scala.reflect.{classTag, ClassTag} + +import org.apache.spark.{Logging, Partition, SparkContext, TaskContext} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.NextIterator + +import java.util.Properties +import kafka.api.FetchRequestBuilder +import kafka.common.{ErrorMapping, TopicAndPartition} +import kafka.consumer.{ConsumerConfig, SimpleConsumer} +import kafka.message.{MessageAndMetadata, MessageAndOffset} +import kafka.serializer.Decoder +import kafka.utils.VerifiableProperties + +case class KafkaRDDPartition( + override val index: Int, + topic: String, + partition: Int, + fromOffset: Long, + untilOffset: Long +) extends Partition + +/** A batch-oriented interface for consuming from Kafka. + * Each given Kafka topic/partition corresponds to an RDD partition. + * Starting and ending offsets are specified in advance, + * so that you can control exactly-once semantics. + * For an easy interface to Kafka-managed offsets, + * see {@link org.apache.spark.rdd.kafka.KafkaCluster} + * @param kafkaParams Kafka a href=http://kafka.apache.org/documentation.html#configuration; + * configuration parameters/a. + * Requires metadata.broker.list or bootstrap.servers to be set with Kafka broker(s), + * NOT zookeeper servers, specified in host1:port1,host2:port2 form. + * @param fromOffsets per-topic/partition Kafka offsets defining the (inclusive) + * starting point of the batch + * @param untilOffsets per-topic/partition Kafka offsets defining the (exclusive) + * ending point of the batch + * @param messageHandler function for translating each message into the desired type + */ +class KafkaRDD[ + K: ClassTag, + V: ClassTag, + U : Decoder[_]: ClassTag, + T : Decoder[_]: ClassTag, + R: ClassTag]( +sc: SparkContext, +val kafkaParams: Map[String, String], +val fromOffsets: Map[TopicAndPartition, Long], +val untilOffsets: Map[TopicAndPartition, Long], +messageHandler: MessageAndMetadata[K, V] = R + ) extends RDD[R](sc, Nil) with Logging { + + assert(fromOffsets.keys == untilOffsets.keys, +Must provide both from and until offsets for each topic/partition) + + override def getPartitions: Array[Partition] = fromOffsets.zipWithIndex.map { kvi = +val ((tp, from), index) = kvi +new KafkaRDDPartition(index, tp.topic, tp.partition, from, untilOffsets(tp)) + }.toArray + + override def compute(thePart: Partition, context: TaskContext) = { +val part = thePart.asInstanceOf[KafkaRDDPartition] +if (part.fromOffset = part.untilOffset) { + log.warn(Beginning offset is same or after ending offset + +sskipping ${part.topic} ${part.partition}) + Iterator.empty +} else { + new NextIterator[R] { +context.addTaskCompletionListener{ context = closeIfNeeded() } + +val kc = new KafkaCluster(kafkaParams) +log.info(sComputing topic ${part.topic}, partition ${part.partition} + + soffsets ${part.fromOffset} - ${part.untilOffset}) +val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties]) + .newInstance(kc.config.props) + .asInstanceOf[Decoder[K]] +val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties]) + .newInstance(kc.config.props) + .asInstanceOf[Decoder[V]] +val consumer: SimpleConsumer = kc.connectLeader(part.topic, part.partition).fold( + errs = throw new Exception( +
[GitHub] spark pull request: [SPARK-5050][Mllib] Add unit test for sqdist
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3869#discussion_r22498301 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala --- @@ -175,6 +177,33 @@ class VectorsSuite extends FunSuite { assert(v.size === x.rows) } + test(sqdist) { +val random = new Random(System.nanoTime()) +for (m - 1 until 1000 by 10) { + val nnz = random.nextInt(m) + 1 + + val indices1 = random.shuffle(0 to m - 1).toArray.slice(0, nnz).sorted --- End diff -- OH interesting... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4660: Use correct default classloader in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3840#issuecomment-68785091 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25059/consoleFull) for PR 3840 at commit [`86bc5eb`](https://github.com/apache/spark/commit/86bc5ebdfb2c5f0d58ffeaf184f94f60923fe676). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4660: Use correct default classloader in...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3840#issuecomment-68785101 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25059/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [STREAMING] SPARK-3505: Augmenting SparkStream...
Github user xiliu82 commented on the pull request: https://github.com/apache/spark/pull/2267#issuecomment-68785881 I will try to do that this week. On Jan 5, 2015, at 11:50 AM, Tathagata Das notificati...@github.com wrote: Ping, for updating this PR. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2267#issuecomment-68765456. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4857] [CORE] Adds Executor membership e...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3711#issuecomment-68787217 [Test build #25060 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25060/consoleFull) for PR 3711 at commit [`6e06a79`](https://github.com/apache/spark/commit/6e06a79c8511ea71182e131ac2a3924aa60f1153). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class SparkListenerAdapter implements SparkListener ` * `case class SparkListenerExecutorAdded(executorId: String, executorDetails: ExecutorDetails)` * `case class SparkListenerExecutorRemoved(executorId: String, executorDetails: ExecutorDetails)` * `class ExecutorDetails(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org