[GitHub] spark pull request #19067: [SPARK-21849][Core]Make the serializer function m...
Github user djvulee closed the pull request at: https://github.com/apache/spark/pull/19067 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19067: [SPARK-21849][Core]Make the serializer function more rob...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/19067 Yes, I agree. it is better included in a normal pull request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19067: [SPARK-21849][Core]Make the serializer function m...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/19067 [SPARK-21849][Core]Make the serializer function more robust ## What changes were proposed in this pull request? make sure the `close` function is called in the `finally` block. ## How was this patch tested? No Test, just compile. You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark serializer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19067.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19067 commit b523ecbef727df73b3b018eb851fb66981e98770 Author: DjvuLee Date: 2017-08-28T07:00:56Z [SPARK-21849][Core]Make the serializer function more robust --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18651: [SPARK-21383][Core] Fix the YarnAllocator allocates more...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/18651 I update the code, please take a look at @vanzin @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128943316 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,9 +534,11 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. amClient.releaseAssignedContainer(containerId) + } finally { +numExecutorsStarting.decrementAndGet() --- End diff -- I agree that put `numExecutorsStarting.decrementAndGet()` together with `numExecutorsRunning.incrementAndGet()` in the `updateInternalState` is better if we can. Why I try to put `numExecutorsStarting.decrementAndGet()` in the `finally` block is that if there some Exceptions is not `NonFatal`, and caught by the following code, we may can not allocated resources as we specified, this is the same as @vanzin worried. We may double the count in the current code, but this only slow down the allocation rate for a small time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128435807 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -525,8 +535,9 @@ private[yarn] class YarnAllocator( } catch { case NonFatal(e) => logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. + // Assigned container should be released immediately + // to avoid unnecessary resource occupation. + numExecutorsStarting.decrementAndGet() --- End diff -- Yes, it is more robust. I have update the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128432387 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -242,7 +244,7 @@ private[yarn] class YarnAllocator( if (executorIdToContainer.contains(executorId)) { val container = executorIdToContainer.get(executorId).get internalReleaseContainer(container) - numExecutorsRunning -= 1 --- End diff -- Yes, I just try to keep consistency with `numExecutorsStarting` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128290766 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -294,7 +296,8 @@ private[yarn] class YarnAllocator( def updateResourceRequests(): Unit = { val pendingAllocate = getPendingAllocate val numPendingAllocate = pendingAllocate.size -val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning +val missing = targetNumExecutors - numPendingAllocate - + numExecutorsStarting.get - numExecutorsRunning.get --- End diff -- Thanks for your advice! I just add the debug info. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18651: [SPARK-21383][Core] Fix the YarnAllocator allocates more...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/18651 I just update the code, and test by experiment, can you take a look at @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128144194 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -505,32 +508,37 @@ private[yarn] class YarnAllocator( if (numExecutorsRunning < targetNumExecutors) { if (launchContainers) { - launcherPool.execute(new Runnable { -override def run(): Unit = { - try { -new ExecutorRunnable( - Some(container), - conf, - sparkConf, - driverUrl, - executorId, - executorHostname, - executorMemory, - executorCores, - appAttemptId.getApplicationId.toString, - securityMgr, - localResources -).run() -updateInternalState() - } catch { -case NonFatal(e) => - logError(s"Failed to launch executor $executorId on container $containerId", e) - // Assigned container should be released immediately to avoid unnecessary resource - // occupation. - amClient.releaseAssignedContainer(containerId) + try { +numExecutorToBeLaunched += 1 +launcherPool.execute(new Runnable { + override def run(): Unit = { +try { + new ExecutorRunnable( +Some(container), +conf, +sparkConf, +driverUrl, +executorId, +executorHostname, +executorMemory, +executorCores, +appAttemptId.getApplicationId.toString, +securityMgr, +localResources + ).run() + updateInternalState() +} catch { + case NonFatal(e) => +logError(s"Failed to launch executor $executorId on container $containerId", e) +// Assigned container should be released immediately +// to avoid unnecessary resource occupation. +amClient.releaseAssignedContainer(containerId) +} } -} - }) +}) + } finally { +numExecutorToBeLaunched -= 1 --- End diff -- Yes, you're right. When I test the code by experiment, i decrease the `numExecutorToBeLaunched` in the `updateInternalState` function, but I later found this may impact the test. I will fix this soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/18651#discussion_r128143898 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -82,6 +82,8 @@ private[yarn] class YarnAllocator( @volatile private var numExecutorsRunning = 0 + @volatile private var numExecutorToBeLaunched = 0 --- End diff -- OKï¼I will change the name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/18651 [SPARK-21383][Core] Fix the YarnAllocator allocates more Resource ## What changes were proposed in this pull request? When NodeManagers launching Executors, the `missing` value will excel the real value when the launch is slow, this can lead to YARN allocates more resource. We add the `numExecutorToBeLaunched` when calculate the `missing` to avoid this. ## How was this patch tested? Test by experiment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark YarnAllocate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18651.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18651 commit 818c9126959e8576861478e18389e6ed8fdbeac4 Author: DjvuLee Date: 2017-07-17T07:54:09Z [Core] Fix the YarnAllocator allocate more Resource When NodeManagers launched the Executors, the missing will excel the real value, this can lead to YARN allocate more resource. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18280: [SPARK-21064][Core][Test] Fix the default value b...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/18280 [SPARK-21064][Core][Test] Fix the default value bug in NettyBlockTransferServiceSuite ## What changes were proposed in this pull request? The default value for `spark.port.maxRetries` is 100, but we use 10 in the suite file. So we change it to 100 to avoid test failure. ## How was this patch tested? No test You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark NettyTestBug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18280.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18280 commit 273f76d183eeda9aef7c9c10dbcd9307773c3eec Author: DjvuLee Date: 2017-06-12T12:13:03Z [SPARK-21064][Core][Test] Fix the default value bug in NettyBlockTransferServiceSuite The defalut value for `spark.port.maxRetries` is 100, but we use 10 in the suite file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18279: [SPARK-21064][Core][Test] Fix the default value b...
Github user djvulee closed the pull request at: https://github.com/apache/spark/pull/18279 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18279: [SPARK-21064][Core][Test] Fix the default value bug in N...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/18279 Ok, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18279: [SPARK-21064][Core][Test] Fix the default value bug in N...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/18279 we should port this to master too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18279: [SPARK-21064][Core][Test] Fix the default value b...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/18279 [SPARK-21064][Core][Test] Fix the default value bug in NettyBlockTran⦠## What changes were proposed in this pull request? Fix the default value bug in NettyBlockTransferServiceSuite. The defalut value for `spark.port.maxRetries` is 100, but we use the 10 in the suite file, we change 10 to 100. ## How was this patch tested? No Test You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18279.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18279 commit 5de1790783f07737432f75ef7ed7ea8804fc6b20 Author: DjvuLee Date: 2017-06-12T11:50:02Z [SPARK-21064][Core][Test] Fix the default value bug in NettyBlockTransferServiceSuite The defalut value for `spark.port.maxRetries` is 100, but we use the 10 in the suite file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15505 >I agree with Kay that putting in a smaller change first is better, assuming it still has the performance gains. That doesn't preclude any further optimizations that are bigger changes. >I'm a little surprised that the serializing tasks has much of an impact, given how little data is getting serialized. But if it really is, I feel like there is a much bigger optimization we're completely missing. Why are we repeating the work of serialization for each task in a taskset? The serialized data is almost exactly the same for every task. they only differ in the partition id (an int) and the preferred locations (which aren't even used by the executor at all). >Task serialization already leverages the idea of having info across all the tasks in the Broadcast for the task binary. We just need to use that same idea for all the rest of the task data that is sent to the executor. Then the only difference between the serialized task data sent to executors is the int for the partitionId. You'd serialize into a bytebuffer once, and then your per-task "serialization" becomes copying the buffer and modifying that int directly. @squito I like this idea very much. I just encounte the de-serialization time is too long (about more than 10s for some tasks). Is there any PR try to solve this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16671: [SPARK-19327][SparkSQL] a better balance partitio...
Github user djvulee closed the pull request at: https://github.com/apache/spark/pull/16671 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 @HyukjinKwon One assumption behind this design is that the specified column has index in most real scenario, so the table scan cost is not much high. What I observed is that most large table has sharding, so count cost is acceptable, this is the reason why we cost less time in a 5M rows table than in a 1M rows table. If we use the `repartition`, there is a bottleneck when loading data from DB and high cost for `repartition`. Anyway, this solution is expensive indeed and not a good one, maybe the best way is using the Spark connectors provided by the DBMS vendors as @gatorsmile suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Yes. I will leave this PR for a few days to seen if others interested in this, and then close it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Yes, I agree with you, sampling bases is the right choose, but through `jdbc` API is not possible to achieve this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Using the *predicates* parameters to split the table seems reasonable, but it just put some work should be done by Spark to users in my personal opinion. Users need know how to split the table uniform at first, so it may use the `count(*)` extra to explode the distribution of the table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Yes, this solution is not suitable for large table, but I can not find a better solution, this is the best optimisation I can find. So just add it as a choose, let the users know what he is doing, and need a explicit enable. From my experience, the origin equal step method can lead to some problem for real data. This conclusion can be get from the spark-user email and our real scenario. Such as users will use the `id` to partition the table, because the `id` is unique and with index, but after many inserts and deletes, the `id` range is very large, and data will lead to a skew distribution by `id`. Very large table is not so common, and if the large table with sharding, this method maybe acceptable. My personal opinion is: >Given another choose for users maybe valuable, only we do not enable it by default. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 @gatorsmile can you take a look at? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Table2 with about 5M rows, 200partition by SparkSQL. (The table using the MySQL sharding, and every partition will return 10K rows at most) old partition result(elements in each partition) >1,49,54,53,60,59,48,61,52,57,60,69,58,57,50,52,51,66,58,45,59,52,61,56,67,51,45,49,70,49,58,59,61,53,50,53,47,50,46,53,55,53,62,55,48,58,52,62,62,37,65,59,58,55,61,59,46,53,49,49,61,72,60,46,50,51,45,47,55,63,64,63,55,47,65,57,60,60,51,45,48,77,58,57,59,39,50,62,55,57,49,63,51,38,49,66,62,58,53,54,50,54,52,69,51,49,61,60,64,49,52,50,54,58,48,51,50,49,41,68,54,45,65,62,44,52,64,58,47,51,65,47,37,42,39,44,51,65,56,54,69,51,61,63,51,52,47,55,58,66,47,54,53,53,60,66,66,68,64,66,55,58,64,55,50,57,46,56,39,60,57,63,40,51,56,58,44,46,46,44,42,52,52,44,53,46,55,57,68,57,62,48,47,52,59,58,49,44,52,47 (most of data is in partition 0, but each partition will return 10K at most because our sharding limit.) new partition result(elements in each partition) >2083,1,1,6932,9799,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,8150,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,9,70,2,1,1,1,655,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,40,76,145,38,86,176,369,696,1338,2776,5381' count cost time: 0.8ms --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16671 Here is the real data test result: Table with 1.2Million rows, 50partition by SparkSQL. old partition result(elements in each partition) >100061,100064,100059,100066,100065,100065,100066,100066,100063,100061,100066,100065,70747,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 new partition result(elements in each partition) >19543,19544,39083,39088,19544,19545,39085,19544,19542,19543,19545,39086,39087,19544,19545,39088,19544,19544,39088,19543,19545,39088,19544,19545,39088,19544,19544,39088,19544,19545,19543,19544,39086,19543,19545,39086,39086,19544,19545,39088,19544,19545,39088,19544,19544,39088,19544,19545,20701,0 count cost time: 1.27s --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16671: [SparkSQL] a better balance partition method for ...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/16671 [SparkSQL] a better balance partition method for jdbc API ## What changes were proposed in this pull request? The partition method in` jdbc` using the equal step, this can lead to skew between partitions. The new method introduce a balance partition method base on the elements when split the elements, this can relieve the skew problem with a little query cost. ## How was this patch tested? UnitTest and real data. You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark balancePartition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16671.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16671 commit 88cdf294aa579f65b8272870d762548cf54349ce Author: DjvuLee Date: 2017-01-20T09:53:57Z [SparkSQL] a better balance partition method for jdbc API The partition method in jdbc when specify the column using te equal step, this can lead to skew between partitions. The new method introduce a new partition method base on the elements when split the elements, this can keep the elements balanced between partitions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16599: [SPARK-19239][PySpark] Check the lowerBound and upperBou...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16599 I update the PR and test the change in pyspark shell. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16599: [SPARK-19239][PySpark] Check the lowerBound and u...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/16599#discussion_r96357233 --- Diff: python/pyspark/sql/readwriter.py --- @@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPar if column is not None: if numPartitions is None: numPartitions = self._spark._sc.defaultParallelism --- End diff -- I have a little worry whether this change will break the API. If some users just specify the `column`, `lowerBound`, `upperBound` in some Spark version, its program will fail after update, even very few people just use the default parallelism. In my personal opinion, I prefer to make a change and keep API consistent. If your opinion is to add the assert on `numPartitions`, I will update the PR soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16599: [SPARK-19239][PySpark] Check the lowerBound and u...
Github user djvulee commented on a diff in the pull request: https://github.com/apache/spark/pull/16599#discussion_r96339764 --- Diff: python/pyspark/sql/readwriter.py --- @@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, lowerBound=None, upperBound=None, numPar if column is not None: if numPartitions is None: numPartitions = self._spark._sc.defaultParallelism +assert lowerBound != None, "lowerBound can not be None when ``column`` is specified" +assert upperBound != None, "upperBound can not be None when ``column`` is specified" --- End diff -- Yes, The Scala code could check this, but the PySpark code will fail at ```int(lowerBound)``` first, so the customer is confused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16599: [SPARK-19239][PySpark] Check the lowerBound and upperBou...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16599 @zsxwing can you take a look at? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16599: [SPARK-19239][PySpark] Check the lowerBound and u...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/16599 [SPARK-19239][PySpark] Check the lowerBound and upperBound whether equals None in jdbc API ## What changes were proposed in this pull request? The `jdbc` API do not check the `lowerBound` and `upperBound` when we specified the ``column``, and just throw the following exception: >```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion. ## How was this patch tested? Test using the pyspark shell, without the lowerBound and upperBound parameters. You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark pysparkFix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16599.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16599 commit 94c44ba368acb3c7fa648ad66cfd3cac352af911 Author: DjvuLee Date: 2017-01-16T08:43:34Z [SPARK-19239][PySparK] Check the lowerBound and upperBound whether equal None in jdbc API The ``jdbc`` API do not check the lowerBound and upperBound when we specified the ``column``, and just throw the following exception: ```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16210: [Core][SPARK-18778]Fix the scala classpath under ...
Github user djvulee closed the pull request at: https://github.com/apache/spark/pull/16210 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16210 Yes, this PR do not consider well, I will close this and update the JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16210 @srowen Sorry for late reply and thanks for your reproduce! As I have mentioned in last reply, this is not a environment problem, but a misunderstand of SPARK_SUBMIT_OPTS by ourself, or a deployment problem. This works for anyone because few people use the ```SPARK_SUBMIT_OPTS``` option and do not put ```SPARK_SUBMIT_OPTS``` in the spark-env.sh file. It maybe better to separate the ```Dscala.usejavacp=true``` from the ```SPARK_SUBMIT_OPTS``` in the spark-shell file to avoid the misunderstanding. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16210 I find the reason, because we pass some SPARK_SUBMIT_OPTS defined by ourself, so it seem that spark only parse the opts defined by ourself, ignore the ```-Dscala.usejavacp=true```. Since we want user to use the `SPARK_SUBMIT_OPTS`, the best way it to separate the ```-Dscala.usejavacp=true``` from the SPARK_SUBMIT_OPTS, maybe move to SparkSubmitCommandBuilder is good idea as suggestioned by @vanzin. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16210 @jodersky Yes. I try different ways, here is the result: ``` SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true -usejavacp" ``` and ``` SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true -Dusejavacp" ``` will output ``` Exception in thread "main" java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) ``` ``` SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -usejavacp" ``` will output: ``` Unrecognized option: -usejavacp Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/16210 @rxin our jdk is jdk1.8.0_91, and we do not install the scala, the OS is Debian 4.6.4. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16210: [Core][SPARK-18778]Fix the scala classpath under ...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/16210 [Core][SPARK-18778]Fix the scala classpath under some environment ## What changes were proposed in this pull request? under some environment, the Dscala.usejavacp=true option seems not work, pass the -usejavacp directly to the repl fix this. ## How was this patch tested? we test in our cluster environment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark sparkShell Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16210.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16210 commit ab81a7af165c7287c0356758097dfa5ded6adea3 Author: DjvuLee Date: 2016-12-08T07:15:59Z [Core]Fix the scala classpath under some envrionment under some envrionment, the Dscala.usejavacp=true option seems not work, pass the -usejavacp directly to the repl fix this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15249 I would say this is a very important PR. As our experience, sometimes we just need to skip some nodes for the bad disks,the exist blacklist mechanism effects little. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metri...
Github user djvulee closed the pull request at: https://github.com/apache/spark/pull/15052 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15052 @srowen Yes, the file seems always empty before write, so the origin way is OK. Sorry for this PR is not thoughtful enough, I just get a mislead by the other method in the shuffle.py, which used the pos before write. I will close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15052 @srowen No. It does not matter whether the file is empty or not, if the file is empty, the `getsize()` just return 0, and this should be OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15052 @srowen I update PR using an increment way to update the DiskBytesSpilled metrics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15052 @srowen you are right, I will correct it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...
Github user djvulee commented on the issue: https://github.com/apache/spark/pull/15052 @srowen @davies mind taking a look? This PR is very simple. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metri...
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/15052 [SPARK-17500][PySpark]Make DiskBytesSpilled metric in PySpark shuffle right ## What changes were proposed in this pull request? The origin way increases the DiskBytesSpilled metric with the file size during each spill in ExternalMerger && ExternalGroupBy, but we only need the last size. ## How was this patch tested? No extra tests, because this just update the metrics Author: Li Hu You can merge this pull request into a Git repository by running: $ git pull https://github.com/djvulee/spark PyDiskSpillMetric Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15052.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15052 commit 1b90b0dd61c22ffba6d578f73cf5aca88629b1be Author: DjvuLee Date: 2016-09-11T19:41:32Z Make DiskBytesSpilled metric in PySpark shuffle right The origin way increase the DiskBytesSpilled metric with the file size during each spill in ExternalMerger && ExternalGroupBy, but we only need the last size. No extra Tests, because this just update the metrics Author: Li Hu --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Branch 1.1 typo error in HistoryServer
GitHub user djvulee opened a pull request: https://github.com/apache/spark/pull/3566 Branch 1.1 typo error in HistoryServer There is a typo error in the 167 & 168 line in HistoryServer.scala file. The "./sbin/spark-history-server.sh " should be "./sbin/start-history-server.sh " You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-1.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3566.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3566 commit 9a62cf3655dcab49b5c0f94ad094603eaf288251 Author: Michael Armbrust Date: 2014-08-27T22:14:08Z [SPARK-3235][SQL] Ensure in-memory tables don't always broadcast. Author: Michael Armbrust Closes #2147 from marmbrus/inMemDefaultSize and squashes the following commits: 5390360 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into inMemDefaultSize 14204d3 [Michael Armbrust] Set the context before creating SparkLogicalPlans. 8da4414 [Michael Armbrust] Make sure we throw errors when leaf nodes fail to provide statistcs 18ce029 [Michael Armbrust] Ensure in-memory tables don't always broadcast. (cherry picked from commit 7d2a7a91f263bb9fbf24dc4dbffde8fe5e2c7442) Signed-off-by: Michael Armbrust commit 0c03fb621e5b080f24863cfc17032bd828b65b99 Author: Patrick Wendell Date: 2014-08-27T22:48:00Z Revert "[maven-release-plugin] prepare for next development iteration" This reverts commit 9af3fb7385d1f9f221962f1d2d725ff79bd82033. commit 0b17c7d4f2176f0c0e8aaab95e034be54467ff30 Author: Patrick Wendell Date: 2014-08-27T22:48:13Z Revert "[maven-release-plugin] prepare release v1.1.0-snapshot2" This reverts commit e1535ad3c6f7400f2b7915ea91da9c60510557ba. commit d4cf7a068da099f0f07f04a834d7edf6b743ceb3 Author: Matthew Farrellee Date: 2014-08-27T22:50:30Z Add line continuation for script to work w/ py2.7.5 Error was - $ SPARK_HOME=$PWD/dist ./dev/create-release/generate-changelist.py File "./dev/create-release/generate-changelist.py", line 128 if day < SPARK_REPO_CHANGE_DATE1 or ^ SyntaxError: invalid syntax Author: Matthew Farrellee Closes #2139 from mattf/master-fix-generate-changelist.py-0 and squashes the following commits: 6b3a900 [Matthew Farrellee] Add line continuation for script to work w/ py2.7.5 (cherry picked from commit 64d8ecbbe94c47236ff2d8c94d7401636ba6fca4) Signed-off-by: Patrick Wendell commit 8597e9cf356b0d8e17600a49efc4c4a0356ecb5d Author: Patrick Wendell Date: 2014-08-27T22:55:59Z BUILD: Updating CHANGES.txt for Spark 1.1 commit 58b0be6a29eab817d350729710345e9f39e4c506 Author: Patrick Wendell Date: 2014-08-27T23:28:08Z [maven-release-plugin] prepare release v1.1.0-rc1 commit 78e3c036eee7113b2ed144eec5061e070b479e56 Author: Patrick Wendell Date: 2014-08-27T23:28:27Z [maven-release-plugin] prepare for next development iteration commit 54ccd93e621c1bc4afc709a208b609232ab701d1 Author: Andrew Or Date: 2014-08-28T06:03:46Z [HOTFIX] Wait for EOF only for the PySpark shell In `SparkSubmitDriverBootstrapper`, we wait for the parent process to send us an `EOF` before finishing the application. This is applicable for the PySpark shell because we terminate the application the same way. However if we run a python application, for instance, the JVM actually never exits unless it receives a manual EOF from the user. This is causing a few tests to timeout. We only need to do this for the PySpark shell because Spark submit runs as a python subprocess only in this case. Thus, the normal Spark shell doesn't need to go through this case even though it is also a REPL. Thanks davies for reporting this. Author: Andrew Or Closes #2170 from andrewor14/bootstrap-hotfix and squashes the following commits: 42963f5 [Andrew Or] Do not wait for EOF unless this is the pyspark shell (cherry picked from commit dafe343499bbc688e266106e4bb897f9e619834e) Signed-off-by: Patrick Wendell commit 233c283e3d946bdcbf418375122c5763559c0119 Author: Michael Armbrust Date: 2014-08-28T06:05:34Z [HOTFIX][SQL] Remove cleaning of UDFs It is not safe to run the closure cleaner on slaves. #2153 introduced this which broke all UDF execution on slaves. Will re-add cleaning of UDF closures in a follow-up PR. Author: Michael Armbrust Closes #2174 from marmbrus/fixUdfs and squashes the following commits: 55406de [Michael Armbrust] [HOTFIX] Remove cleaning of UDFs (cherry p