[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19705 ok, now I question my own testing... does maven not run scalastyle tests? Or did I not run the tests properly somehow? I just ran mvn test from root, and it all seemed to work on my machine --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19705 @gatorsmile @srowen I think this is set now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19705: [SPARK-22308][test-maven] Support alternative uni...
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/19705 [SPARK-22308][test-maven] Support alternative unit testing styles in external applications Continuation of PR#19528 (https://github.com/apache/spark/pull/19529#issuecomment-340252119) The problem with the maven build in the previous PR was the new tests the creation of a spark session outside the tests meant there was more than one spark session around at a time. I was using the spark session outside the tests so that the tests could share data; I've changed it so that each test creates the data anew. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark alternative-style-tests-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19705.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19705 commit b9d41cd79f05f6c420d070ad07cdfa8f853fd461 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-15T03:04:16Z Separate out the portion of SharedSQLContext that requires a FunSuite from the part that works with just any old test suite. commit 0d4bd97247a2d083c7de55663703b38a34298c9c Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-15T15:57:09Z Fix typo in trait name commit 83c44f1c24619e906af48180d0aace38587aa88d Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-15T15:57:42Z Add simple tests for each non-FunSuite test style commit e460612ec6f36e62d8d21d88c2344378ecba581a Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-15T16:20:44Z Document testing possibilities commit 0ee2aadf29b681b23bed356b14038525574204a5 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-18T23:46:44Z Better documentation of testing procedures commit 802a958b640067b99fda0b2c8587dea5b8000495 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-18T23:46:58Z Same initialization issue in SharedSparkContext as is in SharedSparkSession commit 4218b86d5a8ff2321232ff38ed3e1b217ff7db2a Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-23T03:49:39Z Remove documentation of testing commit 2d927e94f627919ac1546b47072276b23d3e8da2 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-24T04:37:48Z Move base versions of PlanTest and SQLTestUtils into the same file as where they came from, in an attempt to make diffs simpler commit 38a83c081b2f9e28bea6321994fc1a0a0c43f252 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-25T14:42:15Z Comment line length should be 100 commit 241459a8a4c554877e381fe8306d086ab5b1b152 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-25T14:43:51Z Move SQLTestUtils object to the end of the file commit 24fc4a324008b2acfcf5a2617eb7cc320565e83c Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-25T15:00:07Z fix scalastyle error (whitespace at end of line) commit e4763d977cffbe7ef362a859c229b74b3cdf4ef3 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-26T02:27:07Z Remove extraneous curly brackets around empty PlanTest body commit 6c0b0d569ae1d779fd9253da0c7e97d12634063c Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-26T03:24:31Z Remove extraneous beforeAll and brackets from SharedSQLContext commit 565c598e89299b8c1473d76249ab732abebdb661 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-11-09T06:39:30Z Make sure no spark sessions are active outside tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 hm... I was always testing with sbt, because maven was so slow to do anything. Will do --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r147038647 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala --- @@ -17,86 +17,8 @@ package org.apache.spark.sql.test -import scala.concurrent.duration._ - -import org.scalatest.BeforeAndAfterEach -import org.scalatest.concurrent.Eventually - -import org.apache.spark.{DebugFilesystem, SparkConf} -import org.apache.spark.sql.{SparkSession, SQLContext} -import org.apache.spark.sql.internal.SQLConf - -/** - * Helper trait for SQL test suites where all tests share a single [[TestSparkSession]]. - */ -trait SharedSQLContext extends SQLTestUtils with BeforeAndAfterEach with Eventually { - - protected def sparkConf = { -new SparkConf() - .set("spark.hadoop.fs.file.impl", classOf[DebugFilesystem].getName) - .set("spark.unsafe.exceptionOnMemoryLeak", "true") - .set(SQLConf.CODEGEN_FALLBACK.key, "false") - } - - /** - * The [[TestSparkSession]] to use for all tests in this suite. - * - * By default, the underlying [[org.apache.spark.SparkContext]] will be run in local - * mode with the default test configurations. - */ - private var _spark: TestSparkSession = null - - /** - * The [[TestSparkSession]] to use for all tests in this suite. - */ - protected implicit def spark: SparkSession = _spark - - /** - * The [[TestSQLContext]] to use for all tests in this suite. - */ - protected implicit def sqlContext: SQLContext = _spark.sqlContext - - protected def createSparkSession: TestSparkSession = { -new TestSparkSession(sparkConf) - } - - /** - * Initialize the [[TestSparkSession]]. - */ +trait SharedSQLContext extends SQLTestUtils with SharedSparkSession { --- End diff -- We could... that would more fit the pattern of what we've done now for PlanTest/PlanTestBase and SQLTestUtils/SQLTestUtilsBase. I hesitated in this case just because the two are such conceptually different concepts, and the idea is that both would actually get used - SharedSQLContext in internal tests, SharedSparkSession in external tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r147038487 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala --- @@ -29,7 +31,14 @@ import org.apache.spark.sql.internal.SQLConf /** * Provides helper methods for comparing plans. */ -trait PlanTest extends SparkFunSuite with PredicateHelper { +trait PlanTest extends SparkFunSuite with PlanTestBase { +} --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r147038504 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala --- @@ -17,86 +17,8 @@ package org.apache.spark.sql.test -import scala.concurrent.duration._ - -import org.scalatest.BeforeAndAfterEach -import org.scalatest.concurrent.Eventually - -import org.apache.spark.{DebugFilesystem, SparkConf} -import org.apache.spark.sql.{SparkSession, SQLContext} -import org.apache.spark.sql.internal.SQLConf - -/** - * Helper trait for SQL test suites where all tests share a single [[TestSparkSession]]. - */ -trait SharedSQLContext extends SQLTestUtils with BeforeAndAfterEach with Eventually { - - protected def sparkConf = { -new SparkConf() - .set("spark.hadoop.fs.file.impl", classOf[DebugFilesystem].getName) - .set("spark.unsafe.exceptionOnMemoryLeak", "true") - .set(SQLConf.CODEGEN_FALLBACK.key, "false") - } - - /** - * The [[TestSparkSession]] to use for all tests in this suite. - * - * By default, the underlying [[org.apache.spark.SparkContext]] will be run in local - * mode with the default test configurations. - */ - private var _spark: TestSparkSession = null - - /** - * The [[TestSparkSession]] to use for all tests in this suite. - */ - protected implicit def spark: SparkSession = _spark - - /** - * The [[TestSQLContext]] to use for all tests in this suite. - */ - protected implicit def sqlContext: SQLContext = _spark.sqlContext - - protected def createSparkSession: TestSparkSession = { -new TestSparkSession(sparkConf) - } - - /** - * Initialize the [[TestSparkSession]]. - */ +trait SharedSQLContext extends SQLTestUtils with SharedSparkSession { protected override def beforeAll(): Unit = { -SparkSession.sqlListener.set(null) -if (_spark == null) { - _spark = createSparkSession -} -// Ensure we have initialized the context before calling parent code super.beforeAll() --- End diff -- we don't. Removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r146887435 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala --- @@ -17,86 +17,8 @@ package org.apache.spark.sql.test -import scala.concurrent.duration._ - -import org.scalatest.BeforeAndAfterEach -import org.scalatest.concurrent.Eventually - -import org.apache.spark.{DebugFilesystem, SparkConf} -import org.apache.spark.sql.{SparkSession, SQLContext} -import org.apache.spark.sql.internal.SQLConf - -/** - * Helper trait for SQL test suites where all tests share a single [[TestSparkSession]]. - */ -trait SharedSQLContext extends SQLTestUtils with BeforeAndAfterEach with Eventually { - - protected def sparkConf = { -new SparkConf() - .set("spark.hadoop.fs.file.impl", classOf[DebugFilesystem].getName) - .set("spark.unsafe.exceptionOnMemoryLeak", "true") - .set(SQLConf.CODEGEN_FALLBACK.key, "false") - } - - /** - * The [[TestSparkSession]] to use for all tests in this suite. - * - * By default, the underlying [[org.apache.spark.SparkContext]] will be run in local - * mode with the default test configurations. - */ - private var _spark: TestSparkSession = null - - /** - * The [[TestSparkSession]] to use for all tests in this suite. - */ - protected implicit def spark: SparkSession = _spark - - /** - * The [[TestSQLContext]] to use for all tests in this suite. - */ - protected implicit def sqlContext: SQLContext = _spark.sqlContext - - protected def createSparkSession: TestSparkSession = { -new TestSparkSession(sparkConf) - } - - /** - * Initialize the [[TestSparkSession]]. - */ +trait SharedSQLContext extends SQLTestUtils with SharedSparkSession { protected override def beforeAll(): Unit = { --- End diff -- It's still used throughout the unit tests. I had noticed another issue related to that ([SPARK-15037]), but that is marked resolved without having gotten rid of this. Note that SharedSQLContext is to SharedSessionContext as PlanTest is to PlanTestBase - it extends FunSuite. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r146885975 --- Diff: core/src/test/scala/org/apache/spark/SharedSparkContext.scala --- @@ -29,10 +29,25 @@ trait SharedSparkContext extends BeforeAndAfterAll with BeforeAndAfterEach { sel var conf = new SparkConf(false) + /** + * Initialize the [[SparkContext]]. Generally, this is just called from --- End diff -- fixed here, not elsewhere yet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r146885926 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -52,17 +55,142 @@ import org.apache.spark.util.{UninterruptibleThread, Utils} * Subclasses should *not* create `SQLContext`s in the test suite constructor, which is * prone to leaving multiple overlapping [[org.apache.spark.SparkContext]]s in the same JVM. */ -private[sql] trait SQLTestUtils - extends SparkFunSuite with Eventually +private[sql] trait SQLTestUtils extends SparkFunSuite with SQLTestUtilsBase with PlanTest { + // Whether to materialize all test data before the first test is run + private var loadTestDataBeforeTests = false + + protected override def beforeAll(): Unit = { +super.beforeAll() +if (loadTestDataBeforeTests) { + loadTestData() +} + } + + /** + * Materialize the test data immediately after the `SQLContext` is set up. + * This is necessary if the data is accessed by name but not through direct reference. + */ + protected def setupTestData(): Unit = { +loadTestDataBeforeTests = true + } + + /** + * Disable stdout and stderr when running the test. To not output the logs to the console, + * ConsoleAppender's `follow` should be set to `true` so that it will honors reassignments of + * System.out or System.err. Otherwise, ConsoleAppender will still output to the console even if + * we change System.out and System.err. + */ + protected def testQuietly(name: String)(f: => Unit): Unit = { +test(name) { + quietly { +f + } +} + } + + /** + * Run a test on a separate `UninterruptibleThread`. + */ + protected def testWithUninterruptibleThread(name: String, quietly: Boolean = false) +(body: => Unit): Unit = { +val timeoutMillis = 1 +@transient var ex: Throwable = null + +def runOnThread(): Unit = { + val thread = new UninterruptibleThread(s"Testing thread for test $name") { +override def run(): Unit = { + try { +body + } catch { +case NonFatal(e) => + ex = e + } +} + } + thread.setDaemon(true) + thread.start() + thread.join(timeoutMillis) + if (thread.isAlive) { +thread.interrupt() +// If this interrupt does not work, then this thread is most likely running something that +// is not interruptible. There is not much point to wait for the thread to termniate, and +// we rather let the JVM terminate the thread on exit. +fail( + s"Test '$name' running on o.a.s.util.UninterruptibleThread timed out after" + +s" $timeoutMillis ms") + } else if (ex != null) { +throw ex + } +} + +if (quietly) { + testQuietly(name) { runOnThread() } +} else { + test(name) { runOnThread() } +} + } +} + +private[sql] object SQLTestUtils { --- End diff -- done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 I assume the "unknown error code, -9" is: [error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt -Phadoop-2.6 -Pflume -Phive-thriftserver -Pyarn -Pkafka-0-8 -Phive -Pkinesis-asl -Pmesos -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest test ; process was terminated by signal 9 It started popping up between build 82895 and 82969 - between which all I did was eliminate the documentation on unit testing. I wouldn't think this should affect the build. Signal 9 is a kill signal - which means something external killed the build, I think. Any idea why this is happening for the last several builds? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 Yeah, as predicted, that made PlanTest very easy to review, but didn't do as well with SQLTestUtils. I suspect I reordered functions and what-not when I was moving stuff around. If this is still too confusing to deal with, just let me know. Even if I can't make the end diffs of the entire PR non-trivial, I could certainly re-implement this in a way that the individual commits would each be trivial; then it'd just be a question of following along commit-by-commit, and shouldn't be too bad. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 @gatorsmile sounds good, giving that a try now... assuming tests pass, I'll check it in and see if it's any better. I've so far done this for PlanTest and SQLTestUtils PlanTest I suspect it will make much cleaner. In SQLTestUtils I suspect it won't help as much, as it was more a pick-and-choose (this function goes in base, this doesn't) I haven't done it at all for SharedSQLContext/SharedSparkSession... that one seems more deserving of a first-level place to me, so I'm more hesitant to, but if you want, I'll do that one too. I suspect the correct answer is going to be redoing the PR, with careful commits that are clearer about what each does... but I'll try this first anyway :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 @gatorsmile the code changes aren't huge - there's almost no new code here, it's all just moving code around from one file to another in order to expose a SharedSparkSession with no dependence on FunSuite. I don't *think* that could be broken up into multiple PRs - half-done, it won't compile. As for taking over the code movement, I'm not sure what you mean; please explain further? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r146312166 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/GenericFlatSpecSuite.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.test + +import org.scalatest.FlatSpec + +/** + * The purpose of this suite is to make sure that generic FlatSpec-based scala + * tests work with a shared spark session + */ +class GenericFlatSpecSuite extends FlatSpec with SharedSparkSession { --- End diff -- Yeah, I wasn't totally sure about that either, but I eventually came to the conclusion that they are useful. I think of it more in terms of preventing regressions; the case they will prevent is the FunSuite dependecy creeping back in to SharedSparkSession. For similar reasons, I should probably put similar tests in for SharedSparkContext. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r146311684 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SharedSparkSession.scala --- @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.test + +import scala.concurrent.duration._ + +import org.scalatest.{BeforeAndAfterEach, Suite} +import org.scalatest.concurrent.Eventually + +import org.apache.spark.{DebugFilesystem, SparkConf} +import org.apache.spark.sql.{SparkSession, SQLContext} +import org.apache.spark.sql.internal.SQLConf + +/** + * Helper trait for SQL test suites where all tests share a single [[TestSparkSession]]. + */ +trait SharedSparkSession --- End diff -- I'm not sure what you're asking. It's specific to the Spark SQL package - in that SparkSession resides there - but not specific to using the language SQL. When I say, 'SQL test suites', I'm referring to the former. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r146311180 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -52,249 +36,23 @@ import org.apache.spark.util.{UninterruptibleThread, Utils} * Subclasses should *not* create `SQLContext`s in the test suite constructor, which is * prone to leaving multiple overlapping [[org.apache.spark.SparkContext]]s in the same JVM. */ -private[sql] trait SQLTestUtils - extends SparkFunSuite with Eventually - with BeforeAndAfterAll - with SQLTestData - with PlanTest { self => - - protected def sparkContext = spark.sparkContext - +private[sql] trait SQLTestUtils extends SparkFunSuite with SQLTestUtilsBase with PlanTest { --- End diff -- This has more changes - I left the data initialization stuff in SQLTestUtils, as it wouldn't be needed by external tests. All the <'s below are things that I pulled out of SQLTestUtilsBase and put back into SQLTestUtils Running similarly: git checkout 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6 SQLTestUtils.scala diff SQLTestUtils.scala SQLTestUtilsBase.scala I get 27d26 < import scala.util.control.NonFatal 30c29 < import org.scalatest.BeforeAndAfterAll --- > import org.scalatest.{BeforeAndAfterAll, Suite} 33d31 < import org.apache.spark.SparkFunSuite 38c36 < import org.apache.spark.sql.catalyst.plans.PlanTest --- > import org.apache.spark.sql.catalyst.plans.PlanTestBase 40d37 < import org.apache.spark.sql.catalyst.util._ 43c40 < import org.apache.spark.util.{UninterruptibleThread, Utils} --- > import org.apache.spark.util.Utils 46c43 < * Helper trait that should be extended by all SQL test suites. --- > * Helper trait that can be extended by all external SQL test suites. 48,49c45 < * This allows subclasses to plugin a custom `SQLContext`. It comes with test data < * prepared in advance as well as all implicit conversions used extensively by dataframes. --- > * This allows subclasses to plugin a custom `SQLContext`. 55,56c51,52 < private[sql] trait SQLTestUtils < extends SparkFunSuite with Eventually --- > private[sql] trait SQLTestUtilsBase > extends Eventually 59c55 < with PlanTest { self => --- > with PlanTestBase { self: Suite => 63,65d58 < // Whether to materialize all test data before the first test is run < private var loadTestDataBeforeTests = false < 80,94d72 < /** <* Materialize the test data immediately after the `SQLContext` is set up. <* This is necessary if the data is accessed by name but not through direct reference. <*/ < protected def setupTestData(): Unit = { < loadTestDataBeforeTests = true < } < < protected override def beforeAll(): Unit = { < super.beforeAll() < if (loadTestDataBeforeTests) { < loadTestData() < } < } < 300,354d277 < /** <* Disable stdout and stderr when running the test. To not output the logs to the console, <* ConsoleAppender's `follow` should be set to `true` so that it will honors reassignments of <* System.out or System.err. Otherwise, ConsoleAppender will still output to the console even if <* we change System.out and System.err. <*/ < protected def testQuietly(name: String)(f: => Unit): Unit = { < test(name) { < quietly { < f < } < } < } < < /** <* Run a test on a separate `UninterruptibleThread`. <*/ < protected def testWithUninterruptibleThread(name: String, quietly: Boolean = false) < (body: => Unit): Unit = { < val timeoutMillis = 1 < @transient var ex: Throwable = null < < def runOnThread(): Unit = { < val thread = new UninterruptibleThread(s"Testing thread for test $name") { < override def run(): Unit = { < try { < body < } catch { < case NonFatal(e) => < ex = e &l
[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/19529#discussion_r146308234 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala --- @@ -18,158 +18,9 @@ package org.apache.spark.sql.catalyst.plans import org.apache.spark.SparkFunSuite -import org.apache.spark.sql.AnalysisException -import org.apache.spark.sql.catalyst.analysis.SimpleAnalyzer -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression -import org.apache.spark.sql.catalyst.plans.logical._ -import org.apache.spark.sql.catalyst.util._ -import org.apache.spark.sql.internal.SQLConf /** * Provides helper methods for comparing plans. */ -trait PlanTest extends SparkFunSuite with PredicateHelper { - - // TODO(gatorsmile): remove this from PlanTest and all the analyzer rules - protected def conf = SQLConf.get - - /** - * Since attribute references are given globally unique ids during analysis, - * we must normalize them to check if two different queries are identical. - */ - protected def normalizeExprIds(plan: LogicalPlan) = { -plan transformAllExpressions { - case s: ScalarSubquery => -s.copy(exprId = ExprId(0)) - case e: Exists => -e.copy(exprId = ExprId(0)) - case l: ListQuery => -l.copy(exprId = ExprId(0)) - case a: AttributeReference => -AttributeReference(a.name, a.dataType, a.nullable)(exprId = ExprId(0)) - case a: Alias => -Alias(a.child, a.name)(exprId = ExprId(0)) - case ae: AggregateExpression => -ae.copy(resultId = ExprId(0)) -} - } - - /** - * Normalizes plans: - * - Filter the filter conditions that appear in a plan. For instance, - * ((expr 1 && expr 2) && expr 3), (expr 1 && expr 2 && expr 3), (expr 3 && (expr 1 && expr 2) - * etc., will all now be equivalent. - * - Sample the seed will replaced by 0L. - * - Join conditions will be resorted by hashCode. - */ - protected def normalizePlan(plan: LogicalPlan): LogicalPlan = { -plan transform { - case Filter(condition: Expression, child: LogicalPlan) => - Filter(splitConjunctivePredicates(condition).map(rewriteEqual).sortBy(_.hashCode()) - .reduce(And), child) - case sample: Sample => -sample.copy(seed = 0L) - case Join(left, right, joinType, condition) if condition.isDefined => -val newCondition = - splitConjunctivePredicates(condition.get).map(rewriteEqual).sortBy(_.hashCode()) -.reduce(And) -Join(left, right, joinType, Some(newCondition)) -} - } - - /** - * Rewrite [[EqualTo]] and [[EqualNullSafe]] operator to keep order. The following cases will be - * equivalent: - * 1. (a = b), (b = a); - * 2. (a <=> b), (b <=> a). - */ - private def rewriteEqual(condition: Expression): Expression = condition match { -case eq @ EqualTo(l: Expression, r: Expression) => - Seq(l, r).sortBy(_.hashCode()).reduce(EqualTo) -case eq @ EqualNullSafe(l: Expression, r: Expression) => - Seq(l, r).sortBy(_.hashCode()).reduce(EqualNullSafe) -case _ => condition // Don't reorder. - } - - /** Fails the test if the two plans do not match */ - protected def comparePlans( - plan1: LogicalPlan, - plan2: LogicalPlan, - checkAnalysis: Boolean = true): Unit = { -if (checkAnalysis) { - // Make sure both plan pass checkAnalysis. - SimpleAnalyzer.checkAnalysis(plan1) - SimpleAnalyzer.checkAnalysis(plan2) -} - -val normalized1 = normalizePlan(normalizeExprIds(plan1)) -val normalized2 = normalizePlan(normalizeExprIds(plan2)) -if (normalized1 != normalized2) { - fail( -s""" - |== FAIL: Plans do not match === - |${sideBySide(normalized1.treeString, normalized2.treeString).mkString("\n")} - """.stripMargin) -} - } - - /** Fails the test if the two expressions do not match */ - protected def compareExpressions(e1: Expression, e2: Expression): Unit = { -comparePlans(Filter(e1, OneRowRelation()), Filter(e2, OneRowRelation()), checkAnalysis = false) - } - - /** Fails the test if the join order in the two plans do not match */ - protected def compareJoinOrder(plan1: LogicalPlan, plan2: LogicalPlan) { -val normalized1 = normalizePlan(normal
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 Documentation removed as per @srowen 's request in the associated JIRA issue [SPARK-22308] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 nope, using lazy val initialization won't work - at the very least, UnsafeKryoSerializerSuite modifies conf before context construction --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 There is one small hack in the way this was done, which is documented - see the comments and documentation on SharedSparkSession.initializeSession and SharedSparkContext.initializeContext. I would rather just have the session and context be lazy transient val's, which would work fine without this initialize call, but I didn't want to change the way tests currently ran without input. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 I made my original changes here by using git mv PlanTest.scala PlanTestBase.scala git mv SQLTestUnit.scala SQLTestUnitBase.scala git mv SharedSQLContext.scala SharedSparkSession.scala and then created new PlanTest, SQLTestUnit, and SharedSQLContext files, under the assumption that most of the code would go in the base, and this would help git provide better history and continuity. I'm not sure if that was the right decision or not - probably it is with PlanTest and SQLTestUnit, possibly not with SharedSQLContext, but since the diff in the PR doesn't reflect the git mv properly, I'm not sure if it will make a difference either way. If reviewers wish me to redo this PR without the initial `git mv`, I'll be happy to. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: Support alternative unit testing styles in extern...
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/19529 Support alternative unit testing styles in external applications ## What changes were proposed in this pull request? Support unit tests of external code (i.e., applications that use spark) using scalatest that don't want to use FunSuite. SharedSparkContext already supports this, but SharedSQLContext does not. I've introduced SharedSparkSession as a parent to SharedSQLContext, written in a way that it does support all scalatest styles. ## How was this patch tested? There are three new unit test suites added that just test using FunSpec, FlatSpec, and WordSpec. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark alternative-style-tests-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19529.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19529 commit b9d41cd79f05f6c420d070ad07cdfa8f853fd461 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-15T03:04:16Z Separate out the portion of SharedSQLContext that requires a FunSuite from the part that works with just any old test suite. commit 0d4bd97247a2d083c7de55663703b38a34298c9c Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-15T15:57:09Z Fix typo in trait name commit 83c44f1c24619e906af48180d0aace38587aa88d Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-15T15:57:42Z Add simple tests for each non-FunSuite test style commit e460612ec6f36e62d8d21d88c2344378ecba581a Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-15T16:20:44Z Document testing possibilities commit 0ee2aadf29b681b23bed356b14038525574204a5 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-18T23:46:44Z Better documentation of testing procedures commit 802a958b640067b99fda0b2c8587dea5b8000495 Author: Nathan Kronenfeld <nicole.ore...@gmail.com> Date: 2017-10-18T23:46:58Z Same initialization issue in SharedSparkContext as is in SharedSparkSession --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] Common interfaces between RDD, DStream, ...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-96656969 I'll see, I guess I can always reopen. If I can get some work towards this done, some deprecation with new versions of the necessary functions, I'll resubmit. Any idea how to deal with the classtag differences? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP] Common interfaces between RDD, DStream, ...
Github user nkronenfeld closed the pull request at: https://github.com/apache/spark/pull/5565 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-95221601 Could you leave it open a little bit longer? I want to see if I can shorten the list of differences, and having it active here makes it a little easier to see if I've done it correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94472545 @srowen - the examples are pretty simple - it happens as soon as you are passing an RDD to something, rather than passing something to an RDD. For instance: case class PipelineData (config: Properties, data: RDD); In each individual case, there are ways around it, of course, but in total they get to be rather ugly rather quickly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94471584 @pwendell - something like reduceByKey having the opposite argument order of everything else in the code seems like an obvious mistake in the API - is there no mechanism to allow for fixing such things? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94519103 @srowen - If I understand your question, you're asking why we can't take our application, written for an RDD, and just put the whole thing inside a call to dstream.foreachRDD... I'm not ignoring it, I'm trying to figure out what would happen if we did that, and that might take a bit. Regarding your answer to @harishreedharan, I would say, of course we don't have to - but we really should have from the start. To have two nearly, but not quite, identical APIs is confusing, and encourages inconsistency; having a common interface enforces that the methods in both mean the same thing now, change together in the future, and are consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94507802 @pwendell I get all that... does that mean we are stuck with those deprecated versions forever then? Or is there some timeframe over which deprecated functions will eventually be removed? Would it be possible to prepare for this, say, 4 releases in advance, so eventually it could be done? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94510832 @srowen are you suggesting putting the entire application inside the foreachRDD call? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94554558 @srowen I'm sure, for what it's worth, that part works out - it's whether the surrounding application structure can fit in that paradigm that is the issue. Just to be clear, I didn't really expect this to ever get merged in as is :-) I am hoping for a few things from this PR: 1. I wanted a list of ununified points the various APIs needed to be unified 2. I wanted to know what sort of compatibility was needed (code-compatible, which with one exception, this is, vs. binary compatible, which it definitely isn't) 3. I was hoping to foster in the community some sense that this was a goal we could work towards * So that when we got to the next point where we could make compatibility changes, we would be ready to do so * And to try and prevent further changes from moving the APIs even farther appart. 4. And, lastly, I was hoping that someone whos scala was better than mine might have some ideas on how to do this without as many API changes (for instance, modifying this so that we could have all 4 versions of reduceByKey, with the bad one deprecated, and still have the compiler intuit types correctly) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94330557 Thanks for taking a look at it. I'll take a look and write up here exactly what API changes this makes - while I don't think there are many, there aren't none, and that's why I was looking for comments before going any further, once I figured out if it were possible. As far as I could tell, I could not get this to compile without making these changes, for reasons I'll comment on individually when I get together a list. As far as the importance of unifying RDD and DStream - I would put that far above unifying RDD and DataFrame. I only included DataFrame because someone had already made it inherit from a class they called RDDApi, which looked to me like a little bit of prep work for doing what I've done. RDD and DStream, while different things, they are used for the same purpose. One of the chief benefits of Spark, touted from early on, was that one could use the same code in varying circumstances - batch jobs, live jobs, streaming jobs, etc. Yet, from the beginning of the streaming side, despite this being touted as a benefit, it wasn't really true. I think making it really true is a huge up-side. I know, for my company, the ability to take our batch jobs and apply them to streaming data without changing our code would be huge, and I can't imagine this isn't true for many other people. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94332668 Ok, I think there are only 3 real API changes, though one of them appears in multiple places. * `RDD` * `flatMap[U](f: T = TraversableOnce[U])` changed to `flatMap[U](f: T = Traversable[U])`. My thinking at the time was, `DStream` used a function to `Traversable`, `RDD` to `TraversableOnce`, and that the `Traversable` part in `DStream` was necessary there. Thinking about it again, I'm not completely sure of that - I think it might be just as valid to change the `DStream` version to `TraversableOnce`, which would probably be more general, and require fewer people to change code * `PairRDDFunctions` * Several of the functions now require a `ClassTag`, as the `DStream` version already did. I tried to take the requirement out of the `DStream` versions instead, since that would be more inclusive, but ran into some compiler problems, and this was easier in the short run. If people think it's doable to change in that direction, I can try. * `combineByKey` * `mapValues` * `flatMapValues` * `reduceByKey` * changed the (partitioner, function) version to (function, partitioner) (see below). I wanted to leave the old version in, deprecated, but for the same reason as the next point, couldn't. * eliminated the (function, int) version. Unfortunately, the compiler seemed to want to pick the type of function generification before it picked which version of the function to use, and because of this, couldn't handle the type inference. I'm not sure someone with more scala foo couldn't solve this. I'll copy the error in here when I can get hold of it again. However, regarding the last two, there is a general pattern in a lot of the spark functions, that they all have three versions: * `foo()` - an N-argument version, that uses the default partitioner * `foo(..., Int)` - an N+1-argument version, that uses a HashPartitioner, and * `foo(..., Partitioner)` - an N+1 argument version, that uses the passed-in partitioner. `PairRDDFunctions.reduceByKey` broke this pattern, but `PairDStreamFunctions.reduceByKey` did not - so, with differing method signatures, one had to change. Since the RDD one was out of sync with pretty much every similar function in the class, it seemed natural to change it - and, I would argue, only makes sense that way. As my initial PR notes mention, I think the ideal thing would be to reduce all these function triplets to a single version, with auto-conversion from `Int` to `HashPartitioner`, which would leave all associated classes cleaner, smaller, and less confusing. Besides these three changes, I don't see any other API changes, but I'll go through it again later to be sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5565#issuecomment-94351472 No, we do that at the moment. But doing it that way results in a few rather ugly constructs in the application code that can be rather painful, as soon as one starts passing data constructs around. As soon as one starts passing the collection structures between modules, say, for instance, between stages in a pipeline, one instantly needs to duplicate the entire pipeline for batch and streaming cases. It isn't just one place where one has to do this replacement - it's every little pipeline operation, for every algorithm, 90% of which are using just the most basic RDD and DStream functions should be easily consolidated. I'd also note that, where there is an interface change, it is there because the original methods in RDD and DStream were declared inconsistently. Unless there is a good reason to keep them inconsistent (which so far I don't see in any of these three cases), I would suggest that isn't a good thing to begin with - just in terms of consistency and usability of the library, where they can be the same, they should be. It reduces the learning curve, and removes some esoteric, hard-to-track-down gotchas that are bound occasionally to bite people newly switching from one case to the other. On a final note, if this is the intended use of dstream, why have the map, flatMap, reduceByKey, etc functions on it at all? It seems clear it was intended to be used this way (Hm, that reminds me of a fourth small interface change I'll add above, but as you'll see, it's very, very minor), so why not make sure the use is the same? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/5565 Common interfaces between RDD, DStream, and DataFrame This PR is the beginning of an attempt to put a common interface between the main distributed collection types in Spark. I've tried to make this first checkin towards such a common interface as simple as possible. To this end, I've taken RDDApi from the sql project, pulled it up into core (as RDDLike), and changed it as necessary to allow all three main distributed collection classes to implement it. I've then done something similar for pair methods, between RDD and DStream (I don't think there is an equivalent for DataFrames) This involves a few small interface changes - things like reduceByKey having different method signatures in different classes - but they are, for the moment, minor. That being said, they are still interface changes, and I don't expect this to get merged in without discussion. So - suggestions and help are welcome, encouraged, etc. In the very near future, if this PR is accepted, I would like to expand on it in a few simple ways: * I want to try to pull more functions up into this interface * There are a lot of functions with 3 versions: * foo(...) * foo(..., numPartitions: Int) * foo(..., partitioner: Partitioner) These should all be replaceble by * foo(..., partitioner: Partitioner = defaultPartitioner) with the implicit Int = Partitioner conversion I've put in here. I did half of this reduction, in once case (reduceByKey) out of necessity, trying to get the implementation contained herein to compile, but extending it as far as possible would make a lot of things much cleaner. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark-1 feature/common-interface2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5565.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5565 commit 9dbbd9ea0e69fd0d5fc5056aeabe4f7efc842cee Author: Nathan Kronenfeld nkronenf...@oculusinfo.com Date: 2015-04-17T20:23:26Z Common interface between RDD, DStream, DataFrame - non-pair methods commit fb920ffc6e30897e19626f6556af3f0ffc5248bb Author: Nathan Kronenfeld nkronenf...@oculusinfo.com Date: 2015-04-17T22:02:20Z Common interface for PairRDD functions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4848] Allow different Worker configurat...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5140#issuecomment-92497626 @andrewor14 - is there anything else that needs to get done here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4848] Stand-alone cluster: Allow differ...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/5140#issuecomment-85117276 I'm not sure how mesos and yarn clusters are started/stopped (nor do I have such clusters on which to test), so I'm not sure how this will affect them. I think the way I did this should be safe - it's mostly just moving code around - but I could use a knowledgeable set of eyes to be sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...
Github user nkronenfeld closed the pull request at: https://github.com/apache/spark/pull/3699 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3699#issuecomment-85112967 I'm redoing this in the latest code, remaking the PR from scratch, to alleviate merge issues. I'll post the new PR here when it's made. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4848] Stand-alone cluster: Allow differ...
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/5140 [Spark-4848] Stand-alone cluster: Allow differences between workers with multiple instances This refixes #3699 with the latest code. This fixes SPARK-4848 I've changed the stand-alone cluster scripts to allow different workers to have different numbers of instances, with both port and web-ui port following allong appropriately. I did this by moving the loop over instances from start-slaves and stop-slaves (on the master) to start-slave and stop-slave (on the worker). Wile I was at it, I changed SPARK_WORKER_PORT to work the same way as SPARK_WORKER_WEBUI_PORT, since the new methods work fine for both. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark-1 feature/spark-4848 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5140.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5140 commit d739640308ca0884bf5cd678dbedf3cc85c3cec9 Author: Nathan Kronenfeld nkronenf...@oculusinfo.com Date: 2015-03-23T17:28:44Z Move looping through instances from the master to the workers, so that each worker respects its own number of instances and web-ui port --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3699#issuecomment-66998011 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/3699 [SPARK-4848] Stand-alone cluster: Allow differences between workers with multiple instances I've changed the stand-alone cluster run scripts to allow different workers to have different numbers of instances, and base webui ports. I did this by moving the loop over instances from start-slaves to start-slave. In order to stop things properly, I had to make similar changes in stop-slaves (and introduce stop-slave). While I was at it, I changed SPARK_WORKER_PORT to work the same way as SPARK_WORKER_UI_PORT, since the new methods works fine for both. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark-1 startup-scripts Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3699.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3699 commit 479c31c9d3e580879d76146e2a687b5235c87b33 Author: Nathan Kronenfeld nkronenf...@oculusinfo.com Date: 2014-12-15T06:58:08Z Move looping through instances from the master to the workers, so that each worker respects its own number of instances and web-ui port. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3699#issuecomment-66957137 I'm not sure how mesos and yarn clusters are started/stopped (nor do I have such clusters on which to test), so I'm not sure how this will affect them. I think the way I did this should be safe - it's mostly just moving code around - but I could use a knowledgeable set of eyes to be sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/3570#discussion_r21572834 --- Diff: core/src/main/scala/org/apache/spark/Accumulators.scala --- @@ -281,7 +282,9 @@ object AccumulatorParam { private object Accumulators { // TODO: Use soft references? = need to make readObject work properly then val originals = Map[Long, Accumulable[_, _]]() - val localAccums = Map[Thread, Map[Long, Accumulable[_, _]]]() + val localAccums = new ThreadLocal[Map[Long, Accumulable[_, _]]]() { +override protected def initialValue() = Map[Long, Accumulable[_, _]]() + } var lastId: Long = 0 --- End diff -- you mean the lastId? That should only ever get used on the client - it's only called from the constructor of an individual accumulator, and if someone is creating one of those on a worker, they're already in trouble - so it shoudl be ok as is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3570#issuecomment-66380241 comment fixed. I'm trying to test the MiMa related changes to see if they work, and having problems running mima on my machine. I'll probably just push them in the suggested form to see if they pass on Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3570#issuecomment-66398560 I thought I'd done so, it looks like it lost my changes I'll fix that asap --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3570#issuecomment-66399588 sorry, must have accidentally hit cancel instead of comment the first time. Should be set now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3570#issuecomment-66238332 Any word on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3570#issuecomment-65883687 Should I back out the correction to the mima failure? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3570#issuecomment-65883762 great... I think outside the mima issue, it should be all set, unless I can figure out a way to unit test it. So far, my best methods of testing it involve instrumenting the code in ways I shouldn't check in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/3570#issuecomment-65883848 oh, a note for when you're reviewing - I didn't move the clear call, I just added a second one; I saw no particular harm in leaving the old one there too, just in case, but I can't see it doing all that much anymore - it should always be a no-op now. I'd be happier removing it if, again, I could figure out a good unit test to make sure all was functioning properly when I did so. But I would be totally open to removing it in the interests of code cleanliness if you want. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Clear local copies of accumulators as soon as ...
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/3570 Clear local copies of accumulators as soon as we're done with them I suspect this PR is not really quite ready for merging, but this seemed the best way of getting a few comments on it. A little explanation: I'm working with some rather large accumulators. I've been asked why not just use a reducer, and the answer is, that when this works, it works about 10x as fast. And so far, when it doesn't work, it seems improvable. This is my first attempt at such improvement. As far as I can tell, on each worker, when each time a task is run, the following happens: 1. The global accumulator object is cleared on that thread 2. The current individual accumulators are registered on that thread (as copies) 3. When the task completes, the accumulator values are collected and returned. Note that when this is done, nothing is cleared. This means, if one is using large accumulators, those accumulators just use up memory uselessly until the next task is run on that thread. And if a thread somehow dies, that memory is used up and can't be retrieved. And, as far as I can tell, that local thread value will never be used again. All I've done so far is to clear local values for that thread as it returns them. I would like to do 2 more things: 1. Reduce this to a two-step process: register, and return, both clearing values 2. Put in something to make sure values are cleared if a task dies (or a thread too, if that is possible) This seems so simple so far, though, I was hoping that I could get some confirmation that I understood things correctly before going on. I don't have a PR or JIRA issue on this so far - the issues I see are a bit hard to pin down. When I run my application without this change, it works, but is very fragile - some tasks take 6 seconds, some 4 minutes. When I implement this, everything seems to run smoothly, and I stop getting random, large task times. This is somewhat related to JIRA issue SPARK-664 in that they both deal with accumulator performance, but it doesn't really address that directly. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark-1 Accumulator-Improvements Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3570.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3570 commit 39a82f2b3ccf68a9a4d1abb0561b5680278a2610 Author: Nathan Kronenfeld nkronenf...@oculusinfo.com Date: 2014-12-03T04:11:11Z Clear local copies of accumulators as soon as we're done with them --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Rectify gereneric parameter names between Spar...
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/2637 Rectify gereneric parameter names between SparkContext and AccumulablePa... AccumulableParam gave its generic parameters as 'R, T', whereas SparkContext labeled them 'T, R'. Trivial, but really confusing. I resolved this in favor of AccumulableParam, because it seemed to have some logic for its names. I also extended this minimal, but at least present, justification into the SparkContext comments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark-1 accumulators Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2637.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2637 commit 98d6b74a20cb39d7b3bc8a6336b5c15ca4e7197a Author: Nathan Kronenfeld nkronenf...@oculusinfo.com Date: 2014-10-02T22:06:18Z Rectify gereneric parameter names between SparkContext and AccumulableParam --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-50836302 OK, @pwendel, I think it's set now. Let me know if there are merge problems, I can resubmit on a clean branch if necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-50621396 Thanks, @pwendel. I can revert it back if you want - is that preferable to the way it is now, with the option to include the memory info or not? I'll start with taking out the DeveloperAPI and adjusting the docs; I'll leave off taking out the optional memory parameter until I hear from you again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-50487526 If I'm reading that correctly, that test failure is from an MLLib change that's nothing to do with what I've done? Perhaps I'll just try it again, maybe it's a bad sync with master: Jenkins, please test this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-50388139 I just parameterized the memory so one can display it or not as desired (with not displaying it the default) - is that sufficient? I forgot to put in the note about the JIRA into the code, I'll definitely add that too, or I can back out the optional nature and just leave in the code comment about the JIRA Let me know which you want, please. Thanks, -Nathan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-49870919 I'm not sure what to do about this test failure; all I've changed is toDebugString, and this is in a spark streaming test which never calls that, so I'm pretty sure it's nothing to do with me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on a diff in the pull request: https://github.com/apache/spark/pull/1535#discussion_r15308845 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1269,6 +1269,19 @@ abstract class RDD[T: ClassTag]( /** A description of this RDD and its recursive dependencies for debugging. */ def toDebugString: String = { +// Get a debug description of an rdd without its children +def debugSelf (rdd: RDD[_]): Seq[String] = { + import Utils.bytesToString + + val persistence = storageLevel.description + val storageInfo = rdd.context.getRDDStorageInfo.filter(_.id == rdd.id).map(info = --- End diff -- I'm not sure what you mean - do you mean an extremely costly operation? Assuming that to be the case, two comments:: * I though about attaching flags to the function so one could specify the type of debug information desired; I think that makes the function too complex, but I'm hardly firm in that idea. * This whole function is specifically to help a developer with debugging. I don't _think_ having it be costly is all that bad. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to RDD.toDebugString
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/1532 Add caching information to RDD.toDebugString I find it useful to see where in an RDD's DAG data is cached, so I figured others might too. I've added both the caching level, and the actual memory state of the RDD. Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark-1 feature/debug-caching Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1532.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1532 commit 06c76ab961afc42e8305d6a3f186361c1e20e04d Author: Nathan Kronenfeld nkronenf...@oculusinfo.com Date: 2014-07-22T14:39:41Z Add caching information to RDD.toDebugString --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to RDD.toDebugString
Github user nkronenfeld closed the pull request at: https://github.com/apache/spark/pull/1532 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/1535 Add caching information to rdd.toDebugString I find it useful to see where in an RDD's DAG data is cached, so I figured others might too. I've added both the caching level, and the actual memory state of the RDD. Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark-1 feature/debug-caching2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1535.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1535 commit 8fbecb6eb47505e7e56949c00107b917c6c5e945 Author: Nathan Kronenfeld nkronenf...@oculusinfo.com Date: 2014-07-22T18:44:58Z Add caching information to rdd.toDebugString --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-49783708 Done, and I also left a comment on Greg Owen's PR from yesterday asking him for formatting comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-49797427 Sorry, forgot to move one small formatting issue over from the old branch, I'll check that in as soon as I test it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-49825329 thanks mark, I had no idea that existed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD
Github user nkronenfeld commented on the pull request: https://github.com/apache/spark/pull/136#issuecomment-38506453 I think my initial sliding method PR addresses @pwendell 's concerns here - but runs afoul of some other concerns raised by @rxin about code complexity. Whichever approach ends up deemed superior overall, I would suggest that there being two PR's for the same operation suggests it isn't all that niche. As a data point, our use had nothing to do with machine learning - it involved deriving travel paths from individual location reports. One suggestion for this PR - make sure the tests include a case with empty partitions between non-empty partitions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---