[GitHub] spark pull request: [SPARK-12330] [MESOS] Fix mesos coarse mode cl...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/10319#discussion_r51470346 --- Diff: docs/running-on-mesos.md --- @@ -387,6 +387,13 @@ See the [configuration page](configuration.html) for information on Spark config + + spark.mesos.coarse.shutdown.ms + 1 (10 seconds) + +Time (in ms) to wait for executors to report that they have exited. Setting this too low has the risk of shutting down the Mesos driver (and thereby killing the spark executors) while the executor is still in the process of exiting cleanly. --- End diff -- documentation needs to be updated accordingly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12330] [MESOS] Fix mesos coarse mode cl...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/10319#discussion_r51470296 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala --- @@ -60,6 +62,12 @@ private[spark] class CoarseMesosSchedulerBackend( // Maximum number of cores to acquire (TODO: we'll need more flexible controls here) val maxCores = conf.get("spark.cores.max", Int.MaxValue.toString).toInt + private[this] val shutdownTimeoutMS = conf.getInt("spark.mesos.coarse.shutdown.ms", 1) --- End diff -- this needs to be ``` conf.getTimeAsMillis("spark.mesos.coarse.shutdownTimeout", "10s") ``` we have a consistent time string format for accepting similar configs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10605][SQL] Add structs to collect_list...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11004#issuecomment-178169233 **[Test build #50493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50493/consoleFull)** for PR 11004 at commit [`326a213`](https://github.com/apache/spark/commit/326a213dc014403aef1033e9d39206f5a873b7a6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user nongli commented on the pull request: https://github.com/apache/spark/pull/10989#issuecomment-178179456 Can you include the generated code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12463][SPARK-12464][SPARK-12465][SPARK-...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10057 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51481347 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamProgress.scala --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import scala.collection.mutable + +/** + * A helper class that looks like a Map[Source, Offset]. + */ +class StreamProgress extends Serializable { + private val currentOffsets = new mutable.HashMap[Source, Offset] --- End diff -- I think we should not make `StreamProgress` extend `Serializable` since we cannot recover it using Java serialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/11006 [SPARK-10820][SQL] Support for the continuous execution of structured queries This is a follow up to 9aadcffabd226557174f3ff566927f873c71672e that extends Spark SQL to allow users to _repeatedly_ optimize and execute structured queries. A `ContinuousQuery` can be expressed using SQL, DataFrames or Datasets. The purpose of this PR is only to add some initial infrastructure which will be extended in subsequent PRs. ## User-facing API - `sqlContext.streamFrom` and `df.streamTo` return builder objects that are analogous to the `read/write` interfaces already available to executing queries in a batch-oriented fashion. - `ContinuousQuery` provides an interface for interacting with a query that is currently executing in the background. ## Internal Interfaces - `StreamExecution` - executes streaming queries in micro-batches The following are currently internal, but public APIs will be provided in a future release. - `Source` - an interface for providers of continually arriving data. A source must have a notion of an `Offset` that monotonically tracks what data has arrived. For fault tolerance, a source must be able to replay data given a start offset. - `Sink` - an interface that accepts the results of a continuously executing query. Also responsible for tracking the offset that should be resumed from in the case of a failure. ## Testing - `MemoryStream` and `MemorySink` - simple implementations of source and sink that keep all data in memory and have methods for simulating durability failures - `StreamTest` - a framework for performing actions and checking invariants on a continuous query You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark structured-streaming Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11006.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11006 commit e238911f494c2db41e36c9cde9dd0b733630b36f Author: Michael ArmbrustDate: 2015-12-10T02:51:06Z first draft commit d2706b511f254feff7d4558550e403215ffb795f Author: Michael Armbrust Date: 2015-12-11T07:49:33Z working on state commit 7a3590fe5dd659d1a47e645eec247e21280a3373 Author: Michael Armbrust Date: 2015-12-11T19:26:40Z working on stateful streaming commit c8a923831bbfc24f71eb744e36a15e432d6ae067 Author: Michael Armbrust Date: 2015-12-12T00:34:19Z now with event time windows commit 192b7fb9efc59f22009adf61a432b5bafc19 Author: Michael Armbrust Date: 2015-12-13T03:16:48Z some refactoring after talking to ali commit a0a1e7bbd85d0de3f6d90793b894a8b00f86a7f0 Author: Michael Armbrust Date: 2015-12-15T00:47:55Z docs commit 15bed31800319547eb616c0cc22c564bf967b7f1 Author: Michael Armbrust Date: 2015-12-15T18:04:42Z start kinesis commit 89464a91e3954f30b68a1f633e2b9c062e08fa2c Author: Michael Armbrust Date: 2015-12-17T02:18:35Z some renaming commit d133dbbed3d3ff7146129cd6532b1f26f2201315 Author: Michael Armbrust Date: 2015-12-28T00:01:57Z WIP: file source commit e3c4c8301fdcfaaa0bd56ed92a81e4e1d2db64a8 Author: Michael Armbrust Date: 2016-01-05T05:39:40Z Merge remote-tracking branch 'origin/master' into streaming-infra Conflicts: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala commit 90fa6d30f0bf70a7c0c4bfe0889a07cbdddbf203 Author: Michael Armbrust Date: 2016-01-05T05:52:43Z remove half-baked stateful implementation commit b1c1dc6ead2c76589e48f4c28ef02e6c4361e549 Author: Michael Armbrust Date: 2016-01-05T06:36:46Z cleanup commit 92050688699b4c509f5ec5c2b636797cb5ae3c89 Author: Michael Armbrust Date: 2016-01-05T23:32:34Z more docs commit 7a59f00b4f0d238c33d3250fcfe162d7c2c27c18 Author: Josh Rosen Date: 2016-01-06T00:00:48Z Add circle.yml file. commit eab186d5780b2946a32041ca17d80323b052efc6 Author: Michael Armbrust Date: 2016-01-06T00:48:42Z rollback changes commit 20750630bd50bf7d0abad4a23b2a717705115e46 Author: Josh Rosen Date: 2016-01-06T00:56:16Z Try using cached resolution to speed up compilation commit dabc102271e09bae2ddca1366deb13b20c10ded3 Author: Josh Rosen Date: 2016-01-06T00:59:10Z Use assembly/assembly. commit
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10989#issuecomment-178133344 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7889] [CORE] HistoryServer to refresh c...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/6935#issuecomment-178139026 jenkins, add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10678 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51480080 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamProgress.scala --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import scala.collection.mutable + +/** + * A helper class that looks like a Map[Source, Offset]. + */ +class StreamProgress extends Serializable { + private val currentOffsets = new mutable.HashMap[Source, Offset] +with mutable.SynchronizedMap[Source, Offset] + + private[streaming] def update(source: Source, newOffset: Offset): Unit = { +currentOffsets.get(source).foreach(old => + assert(newOffset > old, s"Stream going backwards $newOffset -> $old")) +currentOffsets.put(source, newOffset) + } + + private[streaming] def update(newOffset: (Source, Offset)): Unit = +update(newOffset._1, newOffset._2) + + private[streaming] def apply(source: Source): Offset = currentOffsets(source) + private[streaming] def get(source: Source): Option[Offset] = currentOffsets.get(source) + private[streaming] def contains(source: Source): Boolean = currentOffsets.contains(source) + + private[streaming] def ++(updates: Map[Source, Offset]): StreamProgress = { +val updated = new StreamProgress +currentOffsets.foreach(updated.update) +updates.foreach(updated.update) +updated + } + + /** + * Used to create a new copy of this [[StreamProgress]]. While this class is currently mutable, + * it should be copied before being passed to user code. + */ + private[streaming] def copy(): StreamProgress = { +val copied = new StreamProgress +currentOffsets.foreach(copied.update) +copied + } + + override def toString: String = +currentOffsets.map { case (k, v) => s"$k: $v"}.mkString("{", ",", "}") + + override def equals(other: Any): Boolean = other match { +case s: StreamProgress => + s.currentOffsets.keys.toSet == currentOffsets.keys.toSet && --- End diff -- `LongOffset` and `CompositeOffset` don't implement `equals` and `hashCode`, `StreamProgress`'s `equals` and `hashCode` doesn't work. In addition, why not just call `currentOffsets.map(_._1.toString) == s. currentOffsets.map(_._1.toString)` for `equals`, and ``currentOffsets.map(_._1.toString).hashCode` for `hashCode`? `mutable.HashMap` should already do that correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10989#issuecomment-178188600 **[Test build #2486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2486/consoleFull)** for PR 10989 at commit [`c1c0588`](https://github.com/apache/spark/commit/c1c0588053af5aa359b6d03bac6c5d0b198c5b69). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10749][MESOS] Support multiple roles wi...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/8872#discussion_r51481128 --- Diff: core/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSuite.scala --- @@ -0,0 +1,139 @@ +/* --- End diff -- what changed here? Did you move this file or something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12957][SQL] Initial support for constra...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10844#discussion_r51482277 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala --- @@ -180,6 +221,46 @@ case class Join( } } + private def constructIsNotNullConstraints(condition: Expression): Set[Expression] = { +// Currently we only propagate constraints if the condition consists of equality +// and ranges. For all other cases, we return an empty set of constraints +splitConjunctivePredicates(condition).map { + case EqualTo(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case GreaterThan(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case GreaterThanOrEqual(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case LessThan(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case LessThanOrEqual(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case _ => +Set.empty[Expression] +}.foldLeft(Set.empty[Expression])(_ union _.toSet) + } + + override protected def validConstraints: Set[Expression] = { +joinType match { + case Inner if condition.isDefined => +left.constraints + .union(right.constraints) + .union(constructIsNotNullConstraints(condition.get)) --- End diff -- We should also be including the split form of the condition here and below, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51483011 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala --- @@ -0,0 +1,346 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.lang.Thread.UncaughtExceptionHandler + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer +import scala.util.Random + +import org.scalatest.concurrent.Timeouts +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.sql.catalyst.encoders.{encoderFor, ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.streaming._ + +/** + * A framework for implementing tests for streaming queries and sources. + * + * A test consists of a set of steps (expressed as a `StreamAction`) that are executed in order, + * blocking as necessary to let the stream catch up. For example, the following adds some data to + * a stream, blocking until it can verify that the correct values are eventually produced. + * + * {{{ + * val inputData = MemoryStream[Int] +val mapped = inputData.toDS().map(_ + 1) + +testStream(mapped)( + AddData(inputData, 1, 2, 3), + CheckAnswer(2, 3, 4)) + * }}} + * + * Note that while we do sleep to allow the other thread to progress without spinning, + * `StreamAction` checks should not depend on the amount of time spent sleeping. Instead they + * should check the actual progress of the stream before verifying the required test condition. + * + * Currently it is assumed that all streaming queries will eventually complete in 10 seconds to + * avoid hanging forever in the case of failures. However, individual suites can change this + * by overriding `streamingTimeout`. + */ +trait StreamTest extends QueryTest with Timeouts { + + implicit class RichSource(s: Source) { +def toDF(): DataFrame = new DataFrame(sqlContext, StreamingRelation(s)) + } + + /** How long to wait for an active stream to catch up when checking a result. */ + val streamingTimout = 10.seconds + + /** A trait for actions that can be performed while testing a streaming DataFrame. */ + trait StreamAction + + /** A trait to mark actions that require the stream to be actively running. */ + trait StreamMustBeRunning + + /** + * Adds the given data to the stream. Subsuquent check answers will block until this data has + * been processed. + */ + object AddData { +def apply[A](source: MemoryStream[A], data: A*): AddDataMemory[A] = + AddDataMemory(source, data) + } + + /** A trait that can be extended when testing other sources. */ + trait AddData extends StreamAction { +def source: Source + +/** + * Called to trigger adding the data. Should return the offset that will denote when this + * new data has been processed. + */ +def addData(): Offset + } + + case class AddDataMemory[A](source: MemoryStream[A], data: Seq[A]) extends AddData { +override def toString: String = s"AddData to $source: ${data.mkString(",")}" + +override def addData(): Offset = { + source.addData(data) +} + } + + /** + * Checks to make sure that the current data stored in the sink matches the `expectedAnswer`. + * This operation automatically blocks untill all added data has been processed. + */ + object CheckAnswer { +def apply[A : Encoder](data: A*): CheckAnswerRows = { + val encoder = encoderFor[A] + val toExternalRow = RowEncoder(encoder.schema) + CheckAnswerRows(data.map(d => toExternalRow.fromRow(encoder.toRow(d +} + +def apply(rows: Row*):
[GitHub] spark pull request: [SPARK-12506][SPARK-12126][SQL]use CatalystSca...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/11005 [SPARK-12506][SPARK-12126][SQL]use CatalystScan for JDBCRelation As suggested https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526; class="external-link" rel="nofollow">here, I will change JDBCRelation to implement CatalystScan, and then directly access Catalyst expressions in JDBCRDD. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-12126 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11005.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11005 commit 0bdfea86fc0671bf636bbed3174baa417de6367e Author: Huaxin GaoDate: 2016-01-31T04:01:12Z [SPARK-12506][SPARK-12126][SQL]use CatalystScan for JDBCRelation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6847][Core][Streaming]Fix stack overflo...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10934#issuecomment-178132065 I did not merge this into 1.6 or before because of 2 reasons: - It doesn't merge cleanly, and more importantly - This changes internal semantics and it's not technically a bug Let me know if you disagree. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user iyounus commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-178134675 For the case (3), I'm assuming that the label and features are not standardized. So, in that case, the solution exists. Here is my perspective on this. The normal equation `X^T X \beta = X^T y` has a unique solution if matrix `X` is full rank. If its not full rank, the `X^T X` becomes singular and hence not invertible. With regularization, the equation `(X^T X + \lambda I) \beta = X^T y` still has unique solution if the matrix `(X^T X + \lambda I)` is invertible. This is true even if elements of `y` are constant. For the sample data I'm using above, I can solve this equation by hand and obtain `\beta` (with and without intercept). If I'm using any software, it must reproduce these results. Note that standardization is completely independent operation. Its not part of normal equation. So, if the user demands no standardization, then the any software should produce the analytical solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12989] [SQL] Delaying Alias Cleanup aft...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10963#issuecomment-178139377 Thanks, merging to master and 1.6. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178146931 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50497/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12330] [MESOS] Fix mesos coarse mode cl...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/10319#discussion_r51470974 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala --- @@ -364,7 +379,27 @@ private[spark] class CoarseMesosSchedulerBackend( } override def stop() { -super.stop() +// Make sure we're not launching tasks during shutdown +stateLock.synchronized { + if (stopCalled) { +logWarning("Stop called multiple times, ignoring") +return + } + stopCalled = true + super.stop() +} +// Wait for finish +val stopwatch = new Stopwatch() +stopwatch.start() +// slaveIdsWithExecutors has no memory barrier, so this is eventually consistent +while (slaveIdsWithExecutors.nonEmpty && + stopwatch.elapsed(TimeUnit.MILLISECONDS) < shutdownTimeoutMS) { + Thread.sleep(100) +} +if(slaveIdsWithExecutors.nonEmpty) { --- End diff -- style: ``` if (slaveIds...) { } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12330] [MESOS] Fix mesos coarse mode cl...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/10319#discussion_r51470918 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala --- @@ -364,7 +379,27 @@ private[spark] class CoarseMesosSchedulerBackend( } override def stop() { -super.stop() +// Make sure we're not launching tasks during shutdown +stateLock.synchronized { + if (stopCalled) { +logWarning("Stop called multiple times, ignoring") +return + } + stopCalled = true + super.stop() +} +// Wait for finish --- End diff -- can you expand on this comment? From the code my understanding is that we need to wait until all slaves have properly shutdown before we terminate the Mesos task running the driver. It would be good if this comment could provide more context on why we're doing this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12790][CORE] Remove HistoryServer old m...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/10860#discussion_r51472337 --- Diff: core/src/test/resources/spark-events/local-1425081759268/EVENT_LOG_1 --- @@ -0,0 +1,88 @@ +{"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor ID":"","Host":"localhost","Port":57967},"Maximum Memory":278302556,"Timestamp":1425081759407} --- End diff -- I might be missing something, why do we still have test files using the old log format? Shouldn't we just replace all old test logs using the new format? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12790][CORE] Remove HistoryServer old m...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/10860#issuecomment-178166829 sure - that was just to make sure it doesn't get picked up, as a negative test. I can purge all of that since it seems both you and @vanzin have the same feedback? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12951] [SQL] support spilling in genera...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10998#issuecomment-178166922 **[Test build #50494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50494/consoleFull)** for PR 10998 at commit [`02f2a25`](https://github.com/apache/spark/commit/02f2a25346edd718d9371955759478408e77382b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51476814 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import java.lang.Thread.UncaughtExceptionHandler + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.Logging +import org.apache.spark.sql.{ContinuousQuery, DataFrame, SQLContext} +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap} +import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.QueryExecution + +/** + * Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. + * Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any + * [[Source]] present in the query plan. Whenever new data arrives, a [[QueryExecution]] is created + * and the results are committed transactionally to the given [[Sink]]. + */ +class StreamExecution( +sqlContext: SQLContext, +private[sql] val logicalPlan: LogicalPlan, +val sink: Sink) extends ContinuousQuery with Logging { + + /** An monitor used to wait/notify when batches complete. */ + val awaitBatchLock = new Object + + @volatile + var batchRun = false --- End diff -- private --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51476822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import java.lang.Thread.UncaughtExceptionHandler + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.Logging +import org.apache.spark.sql.{ContinuousQuery, DataFrame, SQLContext} +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap} +import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.QueryExecution + +/** + * Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. + * Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any + * [[Source]] present in the query plan. Whenever new data arrives, a [[QueryExecution]] is created + * and the results are committed transactionally to the given [[Sink]]. + */ +class StreamExecution( +sqlContext: SQLContext, +private[sql] val logicalPlan: LogicalPlan, +val sink: Sink) extends ContinuousQuery with Logging { + + /** An monitor used to wait/notify when batches complete. */ + val awaitBatchLock = new Object + + @volatile + var batchRun = false + + /** Minimum amount of time in between the start of each batch. */ + val minBatchTime = 10 --- End diff -- private --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51476807 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import java.lang.Thread.UncaughtExceptionHandler + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.Logging +import org.apache.spark.sql.{ContinuousQuery, DataFrame, SQLContext} +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap} +import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.QueryExecution + +/** + * Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. + * Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any + * [[Source]] present in the query plan. Whenever new data arrives, a [[QueryExecution]] is created + * and the results are committed transactionally to the given [[Sink]]. + */ +class StreamExecution( +sqlContext: SQLContext, +private[sql] val logicalPlan: LogicalPlan, +val sink: Sink) extends ContinuousQuery with Logging { + + /** An monitor used to wait/notify when batches complete. */ + val awaitBatchLock = new Object --- End diff -- private --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51480355 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamProgress.scala --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import scala.collection.mutable + +/** + * A helper class that looks like a Map[Source, Offset]. + */ +class StreamProgress extends Serializable { + private val currentOffsets = new mutable.HashMap[Source, Offset] --- End diff -- @transient since Source is not Serializable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12832][MESOS] mesos scheduler respect a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10949#issuecomment-178189170 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50500/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12832][MESOS] mesos scheduler respect a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10949#issuecomment-178189155 **[Test build #50500 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50500/consoleFull)** for PR 10949 at commit [`7c6650d`](https://github.com/apache/spark/commit/7c6650dceed4eaf9e9f50542c76c88fccf7ac4a2). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [test-maven] Shade protobuf-java
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/10995#issuecomment-178197729 Since Hadoop doesn't shade `protobuf`, I think this won't fix the issue in the PR description. Right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12957][SQL] Initial support for constra...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10844#discussion_r51482050 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala --- @@ -89,9 +89,27 @@ case class Generate( case class Filter(condition: Expression, child: LogicalPlan) extends UnaryNode { override def output: Seq[Attribute] = child.output + + override protected def validConstraints: Set[Expression] = { +val newConstraint = splitConjunctivePredicates(condition) + .filter(_.references.subsetOf(outputSet)) --- End diff -- not needed right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12957][SQL] Initial support for constra...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10844#discussion_r51482081 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala --- @@ -180,6 +221,46 @@ case class Join( } } + private def constructIsNotNullConstraints(condition: Expression): Set[Expression] = { --- End diff -- Why is this done as a special case here, instead of doing it as part of `getRelevantConstraints`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12265][Mesos] Spark calls System.exit i...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10921#issuecomment-178204415 LGTM merging into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12951] [SQL] support spilling in genera...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10998#issuecomment-178128143 **[Test build #50494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50494/consoleFull)** for PR 10998 at commit [`02f2a25`](https://github.com/apache/spark/commit/02f2a25346edd718d9371955759478408e77382b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6166] Limit number of concurrent outbou...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/10838#issuecomment-178141964 @zsxwing Other than the PR title - I think everything else says inflight in code/documentation. Did you see anything to the contrary ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178143411 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50496/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178143406 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5095] [Mesos] Support launching multipl...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10993#issuecomment-178151143 Looks like this is failing real tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12705] [SPARK-10777] [SQL] Analyzer Rul...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10678#issuecomment-178161088 LGTM, thanks for adding a bunch of tests! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12330] [MESOS] Fix mesos coarse mode cl...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/10319#discussion_r51471308 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala --- @@ -364,7 +379,27 @@ private[spark] class CoarseMesosSchedulerBackend( } override def stop() { -super.stop() +// Make sure we're not launching tasks during shutdown +stateLock.synchronized { + if (stopCalled) { +logWarning("Stop called multiple times, ignoring") +return + } + stopCalled = true + super.stop() +} +// Wait for finish +val stopwatch = new Stopwatch() +stopwatch.start() +// slaveIdsWithExecutors has no memory barrier, so this is eventually consistent +while (slaveIdsWithExecutors.nonEmpty && + stopwatch.elapsed(TimeUnit.MILLISECONDS) < shutdownTimeoutMS) { + Thread.sleep(100) +} +if(slaveIdsWithExecutors.nonEmpty) { + logWarning(s"${slaveIdsWithExecutors.size} executors still running. " ++ "Proceeding with mesos driver stop.") --- End diff -- I don't understand this warning message. I think you mean something more like ``` Timed out on waiting for executors to terminate ($X still running) after $timeout ms. Proceeding to stop Mesos driver, which may lead to leftover temporary files on the slaves. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10915#issuecomment-178167383 Thanks, merged to branch-1.6. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12951] [SQL] support spilling in genera...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10998#issuecomment-178167285 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50494/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12951] [SQL] support spilling in genera...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10998#issuecomment-178167283 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12951] [SQL] support spilling in genera...
Github user nongli commented on the pull request: https://github.com/apache/spark/pull/10998#issuecomment-178169324 Can you include the generated code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12790][CORE] Remove HistoryServer old m...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10860#issuecomment-178169396 I see, that seems unnecessary. We deleted all the code path that would parse the format so it's pretty much impossible for it to be picked up right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12299][CORE][WIP] Remove history servin...
Github user BryanCutler commented on the pull request: https://github.com/apache/spark/pull/10991#issuecomment-178169443 cc @andrewor14, @squito for thoughts on above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10605][SQL] Add structs to collect_list...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11004#issuecomment-178169641 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10605][SQL] Add structs to collect_list...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11004#issuecomment-178169642 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50493/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MINOR] Invalid MulticlassClassification r...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/10996#issuecomment-178169426 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5095] [Mesos] Support launching multipl...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10993#issuecomment-178173549 Also, follow-up question: have you tested this with cluster mode and/or with dynamic allocation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12832][MESOS] mesos scheduler respect a...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10949#issuecomment-178178205 By the way, I'd just like to point out that there is another patch that fixes the same issue #10768. @tnachen @dragos what's the difference and which one should we proceed with? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12832][CORE] Fix dispatcher does not ha...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10768#issuecomment-178178342 FYI #10949 is another patch for the same issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12265][Mesos] Spark calls System.exit i...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10921 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12992][SQL]: Update parquet reader to s...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/10908#discussion_r51462281 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java --- @@ -180,6 +188,152 @@ public void readIntegers(int total, ColumnVector c, int rowId, int level, } } + // TODO: can this code duplication be removed without a perf penalty? + public void readBytes(int total, ColumnVector c, +int rowId, int level, VectorizedValuesReader data) { +int left = total; +while (left > 0) { + if (this.currentCount == 0) this.readNextGroup(); + int n = Math.min(left, this.currentCount); + switch (mode) { +case RLE: + if (currentValue == level) { +data.readBytes(n, c, rowId); +c.putNotNulls(rowId, n); + } else { +c.putNulls(rowId, n); + } + break; +case PACKED: + for (int i = 0; i < n; ++i) { +if (currentBuffer[currentBufferIdx++] == level) { + c.putByte(rowId + i, data.readByte()); + c.putNotNull(rowId + i); +} else { + c.putNull(rowId + i); +} + } + break; + } + rowId += n; + left -= n; + currentCount -= n; +} + } + + public void readLongs(int total, ColumnVector c, int rowId, int level, +VectorizedValuesReader data) { +int left = total; +while (left > 0) { + if (this.currentCount == 0) this.readNextGroup(); + int n = Math.min(left, this.currentCount); + switch (mode) { +case RLE: + if (currentValue == level) { +data.readLongs(n, c, rowId); +c.putNotNulls(rowId, n); + } else { +c.putNulls(rowId, n); + } + break; +case PACKED: + for (int i = 0; i < n; ++i) { +if (currentBuffer[currentBufferIdx++] == level) { + c.putLong(rowId + i, data.readLong()); + c.putNotNull(rowId + i); +} else { + c.putNull(rowId + i); +} + } + break; + } + rowId += n; + left -= n; + currentCount -= n; +} + } + + public void readBinarys(int total, ColumnVector c, int rowId, int level, +VectorizedValuesReader data) { +int left = total; +while (left > 0) { + if (this.currentCount == 0) this.readNextGroup(); + int n = Math.min(left, this.currentCount); + switch (mode) { +case RLE: + if (currentValue == level) { +c.putNotNulls(rowId, n); +data.readBinary(n, c, rowId); + } else { +c.putNulls(rowId, n); + } + break; +case PACKED: + for (int i = 0; i < n; ++i) { +if (currentBuffer[currentBufferIdx++] == level) { + c.putNotNull(rowId + i); + data.readBinary(1, c, rowId); +} else { + c.putNull(rowId + i); +} + } + break; + } + rowId += n; + left -= n; + currentCount -= n; +} + } + + + // This is used for decoding dictionary IDs (as opposed to definition levels). + @Override + public void readIntegers(int total, ColumnVector c, int rowId) { +int left = total; +while (left > 0) { +if (this.currentCount == 0) this.readNextGroup(); + int n = Math.min(left, this.currentCount); + switch (mode) { +case RLE: + c.putInts(rowId, n, currentValue); + break; +case PACKED: + c.putInts(rowId, n, currentBuffer, currentBufferIdx); + currentBufferIdx += n; + break; + } + rowId += n; + left -= n; + currentCount -= n; +} + } + + @Override + public byte readByte() { +throw new UnsupportedOperationException("only readInts is valid."); --- End diff -- This should be readInts. The only valid read* APIs that doesn't also decode definition levels is used to decode dictionary ids, which are always ints. I updated the comment for readIntegers() to try to capture this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but
[GitHub] spark pull request: [SPARK-12992][SQL]: Update parquet reader to s...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/10908#discussion_r51462376 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -52,15 +55,49 @@ public void skip(int n) { } @Override - public void readIntegers(int total, ColumnVector c, int rowId) { + public final void readIntegers(int total, ColumnVector c, int rowId) { c.putIntsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); offset += 4 * total; } @Override - public int readInteger() { + public final void readLongs(int total, ColumnVector c, int rowId) { +c.putLongsLittleEndian(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); +offset += 8 * total; + } + + @Override + public final void readBytes(int total, ColumnVector c, int rowId) { +c.putBytes(rowId, total, buffer, offset - Platform.BYTE_ARRAY_OFFSET); +offset += total; + } + + @Override + public final int readInteger() { int v = Platform.getInt(buffer, offset); offset += 4; return v; } + + @Override + public final long readLong() { +long v = Platform.getLong(buffer, offset); +offset += 8; +return v; + } + + @Override + public final byte readByte() { +return (byte)readInteger(); --- End diff -- 4. Parquet only has 2 integers physical types: int32 and int64 and relies on encodings to compress bytes/shorts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6166] Limit number of concurrent outbou...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/10838#issuecomment-178129985 @tgravescs in general, this patch doesn't limit number of concurrent outbound connections. Instead, it just limits number of in flight blocks. Could you update the title and comments/variable names/configuration in this PR to make it accurate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178144361 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12478][SQL] Bugfix: Dataset fields of p...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10431#issuecomment-178158109 Yeah, since its labeled experimental, I'd like to fix as many Dataset bugs as we can for 1.6.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/10989#discussion_r51477107 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/BufferedRowIterator.java --- @@ -54,13 +54,27 @@ public void setInput(Iterator iter) { } /** + * Returns whether `processNext()` should stop processing next row from `input` or not. + */ + protected boolean shouldStop() { --- End diff -- @rxin this seems like it could be used to support limit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12832][MESOS] mesos scheduler respect a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10949#issuecomment-178189166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178191589 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50498/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10749][MESOS] Support multiple roles wi...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/8872#discussion_r51481006 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala --- @@ -525,14 +531,14 @@ private[spark] class MesosClusterScheduler( d.retryState.get.nextRetry.before(currentTime) } - scheduleTasks( + currentOffers = scheduleTasks( --- End diff -- similarly I don't get the point of assigning this var every time you call `scheduleTasks`. It's the same thing, isn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178191582 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12957][SQL] Initial support for constra...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10844#discussion_r51482488 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.plans + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.analysis._ +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ + +class ConstraintPropagationSuite extends SparkFunSuite { + + private def resolveColumn(tr: LocalRelation, columnName: String): Expression = +tr.analyze.resolveQuoted(columnName, caseInsensitiveResolution).get + + private def verifyConstraints(a: Set[Expression], b: Set[Expression]): Unit = { +assert(a.forall(i => b.map(_.semanticEquals(i)).reduce(_ || _))) +assert(b.forall(i => a.map(_.semanticEquals(i)).reduce(_ || _))) --- End diff -- I would make this function manually call `fail` with the condition that we can't find, and also differentiate between `missing` and `found but not expected`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6847][Core][Streaming]Fix stack overflo...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/10934#discussion_r51463360 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala --- @@ -821,6 +821,75 @@ class CheckpointSuite extends TestSuiteBase with DStreamCheckpointTester checkpointWriter.stop() } + test("SPARK-6847: stack overflow when updateStateByKey is followed by a checkpointed dstream") { +// In this test, there are two updateStateByKey operators. The RDD DAG is as follows: +// +// batch 1batch 2batch 3 ... +// +// 1) input rdd input rdd input rdd +// | | | +// v v v +// 2) cogroup rdd ---> cogroup rdd ---> cogroup rdd ... +// | /| /| +// v/ v/ v +// 3) map rdd ---map rdd ---map rdd ... +// | | | +// v v v +// 4) cogroup rdd ---> cogroup rdd ---> cogroup rdd ... +// | /| /| +// v/ v/ v +// 5) map rdd ---map rdd ---map rdd ... +// +// Every batch depends on its previous batch, so "updateStateByKey" needs to do checkpoint to +// break the RDD chain. However, before SPARK-6847, when the state RDD (layer 5) of the second +// "updateStateByKey" does checkpoint, it won't checkpoint the state RDD (layer 3) of the first +// "updateStateByKey" (Note: "updateStateByKey" has already marked that its state RDD (layer 3) +// should be checkpointed). Hence, the connections between layer 2 and layer 3 won't be broken +// and the RDD chain will grow infinitely and cause StackOverflow. +// +// Therefore SPARK-6847 introduces "spark.checkpoint.checkpointAllMarked" to force checkpointing +// all marked RDDs in the DAG to resolve this issue. (For the previous example, it will break +// connections between layer 2 and layer 3) +ssc = new StreamingContext(master, framework, batchDuration) +val batchCounter = new BatchCounter(ssc) +ssc.checkpoint(checkpointDir) +val inputDStream = new CheckpointInputDStream(ssc) +val updateFunc = (values: Seq[Int], state: Option[Int]) => { + Some(values.sum + state.getOrElse(0)) +} +@volatile var shouldCheckpointAllMarkedRDDs = false +@volatile var rddsCheckpointed = false +inputDStream.map(i => (i, i)) + .updateStateByKey(updateFunc).checkpoint(batchDuration) + .updateStateByKey(updateFunc).checkpoint(batchDuration) + .foreachRDD { rdd => +/** + * Find all RDDs that are marked for checkpointing in the specified RDD and its ancestors. + */ +def findAllMarkedRDDs(rdd: RDD[_]): List[RDD[_]] = { --- End diff -- oh I see... that's unfortunate --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6847][Core][Streaming]Fix stack overflo...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10934#issuecomment-178130645 Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10989#issuecomment-178133351 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50495/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12989] [SQL] Delaying Alias Cleanup aft...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10963 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178151805 **[Test build #50498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50498/consoleFull)** for PR 11006 at commit [`7147b3f`](https://github.com/apache/spark/commit/7147b3f8b25e73102a8ba98d81c175acd74f49ca). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Docs] Fix the jar location of datanucleus in ...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10901#issuecomment-178163321 Merging to master and 1.6 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12790][CORE] Remove HistoryServer old m...
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/10860#issuecomment-178159771 Could we merge this please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Docs] Fix the jar location of datanucleus in ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10901 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11780][SQL] Add catalyst type aliases b...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10915#issuecomment-178172474 Can you close this PR now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12832][MESOS] mesos scheduler respect a...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10949#issuecomment-178177036 Also cc @tnachen who wrote this code originally. From the JIRA: > CoarseMesosSchedulerBackend have constraints feature but dispacher deploy use MesosClusterScheduler, it is different method. Is this caused by duplicate code somewhere? Can we resolve that, either in this patch or separately? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10749][MESOS] Support multiple roles wi...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/8872#discussion_r51480776 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala --- @@ -497,28 +505,28 @@ private[spark] class MesosClusterScheduler( container, image, volumes = volumes, portmaps = portmaps) taskInfo.setContainer(container.build()) } -val queuedTasks = tasks.getOrElseUpdate(offer.offer.getId, new ArrayBuffer[TaskInfo]) +val queuedTasks = tasks.getOrElseUpdate(offer.offerId, new ArrayBuffer[TaskInfo]) queuedTasks += taskInfo.build() -logTrace(s"Using offer ${offer.offer.getId.getValue} to launch driver " + +logTrace(s"Using offer ${offer.offerId.getValue} to launch driver " + submission.submissionId) -val newState = new MesosClusterSubmissionState(submission, taskId, offer.offer.getSlaveId, +val newState = new MesosClusterSubmissionState(submission, taskId, offer.slaveId, None, new Date(), None) launchedDrivers(submission.submissionId) = newState launchedDriversState.persist(submission.submissionId, newState) afterLaunchCallback(submission.submissionId) } } +currentOffers --- End diff -- @tnachen I also don't get it. If the caller called us with `currentOffers`, they already have access to it, so what's the point in returning them here? AFAIK we don't mutate this list in this method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: java mapwithstate, broken java mapping
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11007#issuecomment-178190454 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12506][SPARK-12126][SQL]use CatalystSca...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11005#issuecomment-178126949 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7889] [CORE] HistoryServer to refresh c...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/6935#issuecomment-178142661 hi @steveloughran sorry for another really long delay on my end. mostly this looks fine, there are some style nits and a couple of comments that need updating. I also looked into the query param thing -- I played with it a bit more and realized its kind of nuisance to test, but I did write a test on `ApplicationCache` for it which I'll send to you. I just had one more concern with the last read-through -- what happens when an app goes from incomplete to complete? I did some manual testing, and things seem to work. But, I have a fear that there is some lingering state that isn't getting cleaned up. I will try to walk through things more carefully but maybe you understand it well enough that you can reassure me (or perhaps you should take another look yourself as well ...) and of course, there are now merge conflicts which need to be fixed, sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178146926 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6166] Limit number of concurrent outbou...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/10838#issuecomment-178153656 Hm, I think `block` is also not `accurate`. Never mind, just update the title should be enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13113][Core] Remove unnecessary bit ope...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/11002#issuecomment-178160294 This change looks good to me, maybe @JoshRosen who was the last to touch that line can take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12992][SQL]: Update parquet reader to s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10908#issuecomment-178168455 **[Test build #50499 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50499/consoleFull)** for PR 10908 at commit [`2ea2d54`](https://github.com/apache/spark/commit/2ea2d547c908aa4be289959c461b06fbbc0f9f63). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MINOR] Invalid MulticlassClassification r...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10996 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12832][MESOS] mesos scheduler respect a...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/10949#issuecomment-178174598 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12992][SQL]: Update parquet reader to s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10908#issuecomment-178179904 **[Test build #50499 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50499/consoleFull)** for PR 10908 at commit [`2ea2d54`](https://github.com/apache/spark/commit/2ea2d547c908aa4be289959c461b06fbbc0f9f63). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10749][MESOS] Support multiple roles wi...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/8872#discussion_r51480535 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala --- @@ -456,34 +460,36 @@ private[spark] class MesosClusterScheduler( private def scheduleTasks( candidates: Seq[MesosDriverDescription], afterLaunchCallback: (String) => Boolean, - currentOffers: List[ResourceOffer], - tasks: mutable.HashMap[OfferID, ArrayBuffer[TaskInfo]]): Unit = { + currentOffers: JList[ResourceOffer], + tasks: mutable.HashMap[OfferID, ArrayBuffer[TaskInfo]]): JList[ResourceOffer] = { --- End diff -- what does this return? You need to document this in the javadoc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: java mapwithstate, broken java mapping
GitHub user gabrielenizzoli opened a pull request: https://github.com/apache/spark/pull/11007 java mapwithstate, broken java mapping java mapwithstate with Function3 has wrong conversion of java Optional to scala Option, now code uses same conversion used in the mapwithstate call that uses Function4 as an input You can merge this pull request into a Git repository by running: $ git pull https://github.com/gabrielenizzoli/spark branch-1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11007.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11007 commit 78c0323f90d1f31796732229aad84ab1f5f74332 Author: Gabriele NizzoliDate: 2016-02-01T21:03:57Z java mapwithstate broken java mapping java mapwithstate with Function3 has wrong conversion of java Optional to scala Option --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12798] [SQL] generated BroadcastHashJoi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10989#issuecomment-178147379 **[Test build #2486 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2486/consoleFull)** for PR 10989 at commit [`c1c0588`](https://github.com/apache/spark/commit/c1c0588053af5aa359b6d03bac6c5d0b198c5b69). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12790][CORE] Remove HistoryServer old m...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/10860#discussion_r51472679 --- Diff: .rat-excludes --- @@ -63,14 +63,15 @@ logs .*dependency-reduced-pom.xml known_translations json_expectation -local-1422981759269/* -local-1422981780767/* -local-1425081759269/* -local-1426533911241/* -local-1426633911242/* -local-1430917381534/* +local-1422981759269 +local-1422981780767 +local-1425081759269 +local-1426533911241 +local-1426633911242 +local-1430917381534 local-1430917381535_1 local-1430917381535_2 +local-1425081759268/* --- End diff -- what does this mean? Shouldn't we just remove this line and delete the old logs that we're not even using anymore? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51476990 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import java.lang.Thread.UncaughtExceptionHandler + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.Logging +import org.apache.spark.sql.{ContinuousQuery, DataFrame, SQLContext} +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeMap} +import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.QueryExecution + +/** + * Manages the execution of a streaming Spark SQL query that is occurring in a separate thread. + * Unlike a standard query, a streaming query executes repeatedly each time new data arrives at any + * [[Source]] present in the query plan. Whenever new data arrives, a [[QueryExecution]] is created + * and the results are committed transactionally to the given [[Sink]]. + */ +class StreamExecution( +sqlContext: SQLContext, +private[sql] val logicalPlan: LogicalPlan, +val sink: Sink) extends ContinuousQuery with Logging { + + /** An monitor used to wait/notify when batches complete. */ + val awaitBatchLock = new Object + + @volatile + var batchRun = false + + /** Minimum amount of time in between the start of each batch. */ + val minBatchTime = 10 + + /** Tracks how much data we have processed from each input source. */ + private[sql] val currentOffsets = new StreamProgress + + /** All stream sources present the query plan. */ + private val sources = +logicalPlan.collect { case s: StreamingRelation => s.source } + + // Start the execution at the current offsets stored in the sink. (i.e. avoid reprocessing data + // that we have already processed). + { +sink.currentOffset match { + case Some(c: CompositeOffset) => +val storedProgress = c.offsets +val sources = logicalPlan collect { + case StreamingRelation(source, _) => source +} + +assert(sources.size == storedProgress.size) +sources.zip(storedProgress).foreach { case (source, offset) => + offset.foreach(currentOffsets.update(source, _)) +} + case None => // We are starting this stream for the first time. + case _ => throw new IllegalArgumentException("Expected composite offset from sink") +} + } + + logInfo(s"Stream running at $currentOffsets") + + /** When false, signals to the microBatchThread that it should stop running. */ + @volatile private var shouldRun = true + + // TODO: add exception handling to batch thread + /** The thread that runs the micro-batches of this stream. */ + private[sql] val microBatchThread = new Thread("stream execution thread") { +override def run(): Unit = { + SQLContext.setActive(sqlContext) + while (shouldRun) { +attemptBatch() +Thread.sleep(minBatchTime) // TODO: Could be tighter + } +} + } + microBatchThread.setDaemon(true) + microBatchThread.start() + + @volatile + private[sql] var lastExecution: QueryExecution = null + @volatile + private[sql] var streamDeathCause: Throwable = null + + microBatchThread.setUncaughtExceptionHandler( --- End diff -- Could you move this statement above `microBatchThread.start()`? Otherwise, we may miss some exception happening between `microBatchThread.start()` and `microBatchThread.setUncaughtExceptionHandler` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project
[GitHub] spark pull request: [SPARK-12957][SQL] Initial support for constra...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10844#discussion_r51478445 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala --- @@ -180,6 +221,46 @@ case class Join( } } + private def constructIsNotNullConstraints(condition: Expression): Set[Expression] = { +// Currently we only propagate constraints if the condition consists of equality +// and ranges. For all other cases, we return an empty set of constraints +splitConjunctivePredicates(condition).map { + case EqualTo(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case GreaterThan(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case GreaterThanOrEqual(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case LessThan(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case LessThanOrEqual(l, r) => +Set(IsNotNull(l), IsNotNull(r)) + case _ => +Set.empty[Expression] +}.foldLeft(Set.empty[Expression])(_ union _.toSet) + } + + override protected def validConstraints: Set[Expression] = { +joinType match { + case Inner if condition.isDefined => +left.constraints + .union(right.constraints) + .union(constructIsNotNullConstraints(condition.get)) + case LeftSemi if condition.isDefined => +left.constraints + .union(right.constraints) + .union(constructIsNotNullConstraints(condition.get)) + case LeftOuter => +left.constraints + case RightOuter => +right.constraints + case FullOuter => +Set.empty[Expression] +} + } + + def selfJoinResolved: Boolean = left.outputSet.intersect(right.outputSet).isEmpty --- End diff -- I think this was a merging mistake as its duplicated with the method below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12832][MESOS] mesos scheduler respect a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10949#issuecomment-178181958 **[Test build #50500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50500/consoleFull)** for PR 10949 at commit [`7c6650d`](https://github.com/apache/spark/commit/7c6650dceed4eaf9e9f50542c76c88fccf7ac4a2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51478334 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.{Logging, SparkEnv} +import org.apache.spark.sql.{DataFrame, Dataset, Encoder, Row, SQLContext} +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.{encoderFor, RowEncoder} +import org.apache.spark.sql.types.StructType + +object MemoryStream { + protected val currentBlockId = new AtomicInteger(0) + protected val memoryStreamId = new AtomicInteger(0) + + def apply[A : Encoder](implicit sqlContext: SQLContext): MemoryStream[A] = +new MemoryStream[A](memoryStreamId.getAndIncrement(), sqlContext) +} + +/** + * A [[Source]] that produces value stored in memory as they are added by the user. This [[Source]] + * is primarily intended for use in unit tests as it can only replay data when the object is still + * available. + */ +case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext) +extends Source with Logging { + protected val encoder = encoderFor[A] + protected val logicalPlan = StreamingRelation(this) + protected val output = logicalPlan.output + protected val batches = new ArrayBuffer[Dataset[A]] + protected var currentOffset: LongOffset = new LongOffset(-1) + + protected def blockManager = SparkEnv.get.blockManager + + def schema: StructType = encoder.schema + + def getCurrentOffset: Offset = currentOffset + + def toDS()(implicit sqlContext: SQLContext): Dataset[A] = { +new Dataset(sqlContext, logicalPlan) + } + + def toDF()(implicit sqlContext: SQLContext): DataFrame = { +new DataFrame(sqlContext, logicalPlan) + } + + def addData(data: TraversableOnce[A]): Offset = { +import sqlContext.implicits._ +this.synchronized { + currentOffset = currentOffset + 1 + val ds = data.toVector.toDS() + logDebug(s"Adding ds: $ds") + batches.append(ds) + currentOffset +} + } + + override def getNextBatch(start: Option[Offset]): Option[Batch] = synchronized { +val newBlocks = + batches.drop( + start.map(_.asInstanceOf[LongOffset]).getOrElse(LongOffset(-1)).offset.toInt + 1) + +if (newBlocks.nonEmpty) { + logDebug(s"Running [$start, $currentOffset] on blocks ${newBlocks.mkString(", ")}") + val df = newBlocks + .map(_.toDF()) + .reduceOption(_ unionAll _) + .getOrElse(sqlContext.emptyDataFrame) + + Some(new Batch(currentOffset, df)) +} else { + None +} + } + + override def toString: String = s"MemoryStream[${output.mkString(",")}]" +} + +/** + * A sink that stores the results in memory. This [[Sink]] is primarily intended for use in unit + * tests and does not provide durablility. + */ +class MemorySink(schema: StructType) extends Sink with Logging { + /** An order list of batches that have been written to this [[Sink]]. */ + private var batches = new ArrayBuffer[Batch]() + + /** Used to convert an [[InternalRow]] to an external [[Row]] for comparison in testing. */ + private val externalRowConverter = RowEncoder(schema) + + override def currentOffset: Option[Offset] = synchronized { +batches.lastOption.map(_.end) + } + + override def addBatch(nextBatch: Batch): Unit = synchronized { +batches.append(nextBatch) + } + + /** Returns all rows that are stored in this [[Sink]]. */ + def allData: Seq[Row] = synchronized
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178190775 **[Test build #50498 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50498/consoleFull)** for PR 11006 at commit [`7147b3f`](https://github.com/apache/spark/commit/7147b3f8b25e73102a8ba98d81c175acd74f49ca). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10749][MESOS] Support multiple roles wi...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/8872#issuecomment-178195492 @tnachen echoing my comment from another PR: it seems that this feature is already supported in client mode but not in cluster mode. Is there something we can do about this divergence in behavior? Is there some duplicate code to clean up so we don't run into something like this in the future? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11006#discussion_r51483601 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala --- @@ -0,0 +1,346 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.lang.Thread.UncaughtExceptionHandler + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer +import scala.util.Random + +import org.scalatest.concurrent.Timeouts +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.sql.catalyst.encoders.{encoderFor, ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.execution.streaming._ + +/** + * A framework for implementing tests for streaming queries and sources. + * + * A test consists of a set of steps (expressed as a `StreamAction`) that are executed in order, + * blocking as necessary to let the stream catch up. For example, the following adds some data to + * a stream, blocking until it can verify that the correct values are eventually produced. + * + * {{{ + * val inputData = MemoryStream[Int] +val mapped = inputData.toDS().map(_ + 1) + +testStream(mapped)( + AddData(inputData, 1, 2, 3), + CheckAnswer(2, 3, 4)) + * }}} + * + * Note that while we do sleep to allow the other thread to progress without spinning, + * `StreamAction` checks should not depend on the amount of time spent sleeping. Instead they + * should check the actual progress of the stream before verifying the required test condition. + * + * Currently it is assumed that all streaming queries will eventually complete in 10 seconds to + * avoid hanging forever in the case of failures. However, individual suites can change this + * by overriding `streamingTimeout`. + */ +trait StreamTest extends QueryTest with Timeouts { + + implicit class RichSource(s: Source) { +def toDF(): DataFrame = new DataFrame(sqlContext, StreamingRelation(s)) + } + + /** How long to wait for an active stream to catch up when checking a result. */ + val streamingTimout = 10.seconds + + /** A trait for actions that can be performed while testing a streaming DataFrame. */ + trait StreamAction + + /** A trait to mark actions that require the stream to be actively running. */ + trait StreamMustBeRunning + + /** + * Adds the given data to the stream. Subsuquent check answers will block until this data has + * been processed. + */ + object AddData { +def apply[A](source: MemoryStream[A], data: A*): AddDataMemory[A] = + AddDataMemory(source, data) + } + + /** A trait that can be extended when testing other sources. */ + trait AddData extends StreamAction { +def source: Source + +/** + * Called to trigger adding the data. Should return the offset that will denote when this + * new data has been processed. + */ +def addData(): Offset + } + + case class AddDataMemory[A](source: MemoryStream[A], data: Seq[A]) extends AddData { +override def toString: String = s"AddData to $source: ${data.mkString(",")}" + +override def addData(): Offset = { + source.addData(data) +} + } + + /** + * Checks to make sure that the current data stored in the sink matches the `expectedAnswer`. + * This operation automatically blocks untill all added data has been processed. + */ + object CheckAnswer { +def apply[A : Encoder](data: A*): CheckAnswerRows = { + val encoder = encoderFor[A] + val toExternalRow = RowEncoder(encoder.schema) + CheckAnswerRows(data.map(d => toExternalRow.fromRow(encoder.toRow(d +} + +def apply(rows: Row*):
[GitHub] spark pull request: [test-maven] Shade protobuf-java
Github user tedyu commented on the pull request: https://github.com/apache/spark/pull/10995#issuecomment-178202908 This would still benefit Spark standalone and Spark on Mesos, right ? For Spark on YARN, status quo is maintained. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10820][SQL] Support for the continuous ...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/11006#issuecomment-178205092 Looks great! Just some minor comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org