[GitHub] spark issue #11228: [SPARK-13356][Streaming]WebUI missing input informations...
Github user jeanlyn commented on the issue: https://github.com/apache/spark/pull/11228 @tdas I had added a unit test to this PR, could you take some time to review ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11228: [SPARK-13356][Streaming]WebUI missing input informations...
Github user jeanlyn commented on the issue: https://github.com/apache/spark/pull/11228 @tdas OK, i will try to add unit test these day. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11228: [SPARK-13356][Streaming]WebUI missing input informations...
Github user jeanlyn commented on the issue: https://github.com/apache/spark/pull/11228 @vanzin Sorry for the late reply, I had solved the conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/12150#issuecomment-206053030 OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/12150 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/12150#issuecomment-205341839 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14243][CORE]update task metrics when re...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/12091#issuecomment-205334309 @andrewor14 Sorry for the late reply, I had submitted a patch for 1.6 #12150 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/12150#issuecomment-205334579 /cc @andrewor14 . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/12150 [SPARK-14243][CORE][BACKPORT-1.6]update task metrics when removing blocks ## What changes were proposed in this pull request? This patch try to update the `updatedBlockStatuses ` when removing blocks, making sure `BlockManager` correctly updates `updatedBlockStatuses` ## How was this patch tested? test("updated block statuses") in BlockManagerSuite.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark updataBlock1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12150.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12150 commit ff151af2d2495f99c3c22a3f17c78f5ff4fd9402 Author: jeanlyn Date: 2016-04-04T14:35:14Z add metrics when removing blocks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14243][CORE]update task metrics when re...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/12091#issuecomment-204006712 cc @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: update task metrics when removing blocks
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/12091 update task metrics when removing blocks ## What changes were proposed in this pull request? This PR try to use `incUpdatedBlockStatuses ` to update the `updatedBlockStatuses ` when removing blocks, making sure `BlockManager` correctly updates `updatedBlockStatuses` ## How was this patch tested? test("updated block statuses") in BlockManagerSuite.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark updateBlock Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12091.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12091 commit a97d220a96ed1f511f617f0f98663b82f2fd26ed Author: jeanlyn Date: 2016-03-30T16:01:20Z update metrics when removing blocks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE][Backport-1.6]Using onBlock...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/12028#issuecomment-203173603 OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE][Backport-1.6]Using onBlock...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/12028 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE][Backport-1.6]Using onBlock...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/12028#issuecomment-202777494 /cc @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE][Backport-1.6]Using onBlock...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/12028 [SPARK-13845][CORE][Backport-1.6]Using onBlockUpdated to replace onTaskEnd avioding driver OOM ## What changes were proposed in this pull request? We have a streaming job using `FlumePollInputStream` always driver OOM after few days, here is some driver heap dump before OOM ``` num #instances #bytes class name -- 1: 13845916 553836640 org.apache.spark.storage.BlockStatus 2: 14020324 336487776 org.apache.spark.storage.StreamBlockId 3: 13883881 333213144 scala.collection.mutable.DefaultEntry 4: 8907 89043952 [Lscala.collection.mutable.HashEntry; 5: 62360 65107352 [B 6:163368 24453904 [Ljava.lang.Object; 7:293651 20342664 [C ... ``` `BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in the end. After investigated, i found the `executorIdToStorageStatus` in `StorageStatusListener` seems never remove the blocks from `StorageStatus`. In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd ` , so we can update the block informations(add blocks, drop the block from memory to disk and delete the blocks) in time. ## How was this patch tested? Existing unit tests and manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark fixoom1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12028.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12028 commit f37f3594b71df0083880aa5395a77808fd6c72b3 Author: jeanlyn Date: 2016-03-29T05:06:03Z Using onBlockUpdated to replace onTaskEnd avioding driver OOM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11679#issuecomment-202653414 OK, I will fill a JIRA later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11779#issuecomment-202649212 Sure. I will submit a patch against branch-1.6 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/11779 [SPARK-13845][CORE]Using onBlockUpdated to replace onTaskEnd avioding driver OOM ## What changes were proposed in this pull request? We have a streaming job using `FlumePollInputStream` always driver OOM after few days, here is some driver heap dump before OOM ``` num #instances #bytes class name -- 1: 13845916 553836640 org.apache.spark.storage.BlockStatus 2: 14020324 336487776 org.apache.spark.storage.StreamBlockId 3: 13883881 333213144 scala.collection.mutable.DefaultEntry 4: 8907 89043952 [Lscala.collection.mutable.HashEntry; 5: 62360 65107352 [B 6:163368 24453904 [Ljava.lang.Object; 7:293651 20342664 [C ... ``` `BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in the end. After investigated, i found the `executorIdToStorageStatus` in `StorageStatusListener` seems never remove the blocks from `StorageStatus`. In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd ` , so we can update the block informations(add blocks, drop the block from memory to disk and delete the blocks) in time. ## How was this patch tested? Existing unit tests and manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark fix_driver_oom Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11779.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11779 commit f1f6df685b416b1f7fa47f9419dc5a0300e63070 Author: jeanlyn Date: 2016-03-12T08:01:39Z using onBlockUpdated to replace onTaskEnd when the block updated commit 5255067c40b49efab597dfbb13ccfa9b6f7b0833 Author: jeanlyn Date: 2016-03-12T12:52:04Z fix unit test commit ab4a863b71ee28791b43daea1af58453906af8c0 Author: jeanlyn Date: 2016-03-15T01:01:53Z fix scala style commit 59150927dfe7c1eacd244a359fc5e8d040d4dc07 Author: jeanlyn Date: 2016-03-17T06:24:01Z fix HistoryServerSuite commit 2e2c319cd0dafe3ee8ae8d4fb8fe958ae5ddcd4e Author: jeanlyn Date: 2016-03-17T09:16:39Z fix typos --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11779#issuecomment-197786170 This PR is the same as #11679 , but i came across with some accidents when rebasing the PR. So i create a new one. /cc @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11679#issuecomment-197674341 All test failure is relevant with `HistoryServerSuite`, the reason is we remove the `onTaskEnd`, and it's used to replay the storage page of history server from the even log. It seems that we don't want to log the `onBlockUpdated` even to the file. See:https://github.com/apache/spark/blob/184085284185011d7cc6d054b54d2d38eaf1dd77/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L196 Hence, is acceptable to generate the new `*_expectation.json` of `src/test/resources/HistoryServerExpectations/` to pass the unit test for now? @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11679#issuecomment-197778717 close this for accident --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/11679 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11679#issuecomment-197127323 I think the the metrics of `updatedBlockStatuses` does not updated using code like ``` c.taskMetrics().incUpdatedBlockStatuses(Seq((blockId, status))) ``` in `BlockManager.removeBlock` and the method invoke `removeBlock` when removing the blocks cause the problem. see: master:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1226-1226 branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108 branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101 So when the the task end, we can not update the `executorIdToStorageStatus` in `StorageStatusListener` correctly. Let me know if my understanding is wrong. Forgive me for didn't explain the problem clearly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11679#issuecomment-196617986 @andrewor14 It seems the MIMA failure do not relevant with this patch. Do i need to fix it in this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11679#issuecomment-196591027 Thanks @andrewor14 for review. We encounter this issue in branch-1.5, and I had noticed recent changes of metrics. If i understand correctly, I think the root cause of this issue is the `metrics.updatedBlockStatuses` does not be updated in `BlockManager.removeBlock` when removing blocks, but i prefer to use `onBlockUpdated` to solute this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [CORE]Using onBlockUpdated to replace onTaskEn...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/11679 [CORE]Using onBlockUpdated to replace onTaskEnd avioding driver OOM ## What changes were proposed in this pull request? We have a streaming job using `FlumePollInputStream` always driver OOM after few days, here is some driver heap dump before OOM ``` num #instances #bytes class name -- 1: 13845916 553836640 org.apache.spark.storage.BlockStatus 2: 14020324 336487776 org.apache.spark.storage.StreamBlockId 3: 13883881 333213144 scala.collection.mutable.DefaultEntry 4: 8907 89043952 [Lscala.collection.mutable.HashEntry; 5: 62360 65107352 [B 6:163368 24453904 [Ljava.lang.Object; 7:293651 20342664 [C ``` `BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in the end. After investigated, i found the `executorIdToStorageStatus` in `StorageStatusListener` seems never remove the blocks from `StorageStatus`. In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd ` , so we can update the block informations(add blocks, drop the block from memory to disk and delete the blocks) in time. ## How was this patch tested? Existing unit tests and manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark fix_driver_oom Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11679.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11679 commit f1f6df685b416b1f7fa47f9419dc5a0300e63070 Author: jeanlyn Date: 2016-03-12T08:01:39Z using onBlockUpdated to replace onTaskEnd when the block updated commit 5255067c40b49efab597dfbb13ccfa9b6f7b0833 Author: jeanlyn Date: 2016-03-12T12:52:04Z fix unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/11440 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190994425 Thanks @jerryshao @srowen @zsxwing for suggestions.I close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190613342 My bad. I will try to figure out the way to fix the when window operations appear with the config set to true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190608101 @jerryshao Thanks for the explanation. I see what you mean. It's only happen in the beginning, and if the stop time is much longer than the window time, i think it's acceptable to skip those down time batch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190568465 Thanks @jerryshao for suggestion! > Jobs generated in the down time can be used for WAL replay, did you test when these down jobs are removed, the behavior of WAL replay is still correct? It seems that the `pendingTimes` is use for WAL replay, i do not skip these batches > Also for some windowing operations, I think this removal of down time jobs may possibly lead to the inconsistent result of windowing aggregation. Does inconsistent result mean wrong result? Also, i will running the unit test with the config set to true by default in my local computer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586]add config to skip generate down ...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/11440 [SPARK-13586]add config to skip generate down time batch when restart StreamingContext ## What changes were proposed in this pull request? The patch try to add a config `spark.streaming.skipDownTimeBatch` to control whether generate the down time batches when restarting StreamingContext. By default, it will be set to false. ## How was this patch tested? unit test: test("SPARK-13586: do no generate down time batch when recovering from checkpoint") You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark skipDownTime Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11440.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11440 commit 089d0af74317b767378c16673ac9d67f6dfd9972 Author: jeanlyn Date: 2016-02-29T07:15:38Z add config to generate down time batch commit 9068881aeb17bb77383b3e9eecf01c463f62113c Author: jeanlyn Date: 2016-03-01T03:21:23Z add jira num --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13356][Streaming]WebUI missing input in...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11228#issuecomment-185001455 @JoshRosen Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13356][Streaming]WebUI missing input in...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11228#issuecomment-184981662 @tdas @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13356][Streaming]WebUI missing input in...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/11228 [SPARK-13356][Streaming]WebUI missing input informations when recovering from dirver failure Issue link:[SPARK-13356](https://issues.apache.org/jira/browse/SPARK-13356) You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark checkpoint Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11228.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11228 commit c858d8398d22b158f0e36bdeeefb2d6941efd6da Author: jeanlyn Date: 2016-02-16T15:06:13Z report info when restarting streamingContext --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169563822 It's difference from join selection, it just pull out nondeterministic expressions of join condition to the left or right children, but it seems it can reuse the code of `ExtractEquiJoinKeys`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-169524471 @marmbrus you are right. But i think @zhonghaihua 's solution is try to reduce cartesian product possibility, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/10128#issuecomment-161851642 @cloud-fan I think your case is different from @zhonghaihua 's. The sql only deal with some join keys ('' and null) before shuffle to handle those pointless key cause skew during join operator, while `repartition` deal with all data before some map operator. This is two type data skew, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/7927#issuecomment-137046713 It seems that the failure not related. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9192][SQL] add initialization phase for...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/7535#discussion_r38332979 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -129,6 +128,14 @@ trait CheckAnalysis { failAnalysis( s"unresolved operator ${operator.simpleString}") + case o if o.expressions.exists(!_.deterministic) && +!o.isInstanceOf[Project] && !o.isInstanceOf[Filter] => +failAnalysis( --- End diff -- Yep, we need to change this sql manually after this behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9192][SQL] add initialization phase for...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/7535#discussion_r38289221 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -129,6 +128,14 @@ trait CheckAnalysis { failAnalysis( s"unresolved operator ${operator.simpleString}") + case o if o.expressions.exists(!_.deterministic) && +!o.isInstanceOf[Project] && !o.isInstanceOf[Filter] => +failAnalysis( --- End diff -- Hi, @cloud-fan .Can it support for join operation? Sometimes we can use some `non deterministic ` expression to eval some pointless join keys(with respect to business logic) avoiding data skew. For example ```sql SELECT src.key, src.value, src1.value FROM src JOIN src1 ON UPPER((CASE WHEN (src.key IS NULL OR src.key = '' ) THEN CAST( (-RAND() * 1000 ) AS string ) ELSE src.key END )) = UPPER(src1.key) ``` What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10198][SQL] Turn off partition verifica...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/8404#discussion_r37867722 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -18,50 +18,54 @@ package org.apache.spark.sql.hive import com.google.common.io.Files +import org.apache.spark.sql.test.SQLTestUtils import org.apache.spark.sql.{QueryTest, _} import org.apache.spark.util.Utils -class QueryPartitionSuite extends QueryTest { +class QueryPartitionSuite extends QueryTest with SQLTestUtils { private lazy val ctx = org.apache.spark.sql.hive.test.TestHive import ctx.implicits._ - import ctx.sql + + protected def _sqlContext = ctx test("SPARK-5068: query data when path doesn't exist"){ -val testData = ctx.sparkContext.parallelize( - (1 to 10).map(i => TestData(i, i.toString))).toDF() -testData.registerTempTable("testData") +withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) { --- End diff -- Should be set to `true`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36587468 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -590,10 +590,24 @@ private[spark] class BlockManager( private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): Option[Any] = { require(blockId != null, "BlockId is null") val locations = Random.shuffle(master.getLocations(blockId)) +var attemptTimes = 0 for (loc <- locations) { logDebug(s"Getting remote block $blockId from $loc") - val data = blockTransferService.fetchBlockSync( -loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + val data = try { +blockTransferService.fetchBlockSync( + loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + } catch { +case t: Throwable if attemptTimes < locations.size - 1 => + // Return null when Exception throw, so we can fetch block + // from another location if there still have locations + attemptTimes += 1 + logWarning(s"Try $attemptTimes times getting remote block $blockId from $loc failed.", t) + null +case t: Throwable => + // Throw BlockFetchException wraps the last Exception when + // there is no block we can fetch + throw new BlockFetchException(t) --- End diff -- Sure! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/7927#issuecomment-128582275 Thanks everyone for the review! I updated the code, and now `doGetRemote` will accept exception when we still have location to fetch the block avoiding the work flow being broken, and the `BlockFetchException`(this is the behavior of `doGetRemote`, so i use a new exception and wrap the last exception,any suggestions?) will throw when there is no location. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36302564 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -443,6 +448,34 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE assert(list2DiskGet.get.readMethod === DataReadMethod.Disk) } + test("(SPARK-9591)getRemoteBytes from another location when IOException throw") { +try { + conf.set("spark.network.timeout", "2s") + store = makeBlockManager(8000, "executor1") + store2 = makeBlockManager(8000, "executor2") + store3 = makeBlockManager(8000, "executor3") + val list1 = List(new Array[Byte](4000)) + store2.putIterator("list1", list1.iterator, StorageLevel.MEMORY_ONLY, tellMaster = true) + store3.putIterator("list1", list1.iterator, StorageLevel.MEMORY_ONLY, tellMaster = true) + var list1Get = store.getRemoteBytes("list1") + assert(list1Get.isDefined, "list1Get expected to be fetched") + // block manager exit + store2.stop() + store2 = null + list1Get = store.getRemoteBytes("list1") + // get `list1` block + assert(list1Get.isDefined, "list1Get expected to be fetched") + store3.stop() + store3 = null + // exception throw because there is no locations + intercept[java.io.IOException] { +list1Get = store.getRemoteBytes("list1") + } --- End diff -- Yes,the test pass in my local laptop. I only catch `IOException` when there still has locations we can fetch the block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36294518 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -443,6 +448,37 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE assert(list2DiskGet.get.readMethod === DataReadMethod.Disk) } + test("(SPARK-9591)getRemoteBytes from another location when IOException throw") { +try { + conf.set("spark.network.timeout", "2s") + store = makeBlockManager(8000, "excutor1") --- End diff -- My bad. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36217402 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -592,8 +592,14 @@ private[spark] class BlockManager( val locations = Random.shuffle(master.getLocations(blockId)) for (loc <- locations) { logDebug(s"Getting remote block $blockId from $loc") - val data = blockTransferService.fetchBlockSync( -loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + val data = try { +blockTransferService.fetchBlockSync( + loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + } catch { +case e: Throwable => + logWarning(s"Exception during getting remote block $blockId from $loc", e) --- End diff -- Make sense,I will fix it later,thanks @squito a lot! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36182323 --- Diff: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala --- @@ -443,6 +443,21 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE assert(list2DiskGet.get.readMethod === DataReadMethod.Disk) } + test("block manager crash test") { +conf.set("spark.network.timeout", "5s") +store = makeBlockManager(8000) +store2 = makeBlockManager(8000) +val list1 = List(new Array[Byte](4000)) +store2.putIterator("list1", list1.iterator, StorageLevel.MEMORY_ONLY, tellMaster = true) +val list1get = store.getRemote("list1") --- End diff -- Thanks,I will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36182313 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -592,8 +592,14 @@ private[spark] class BlockManager( val locations = Random.shuffle(master.getLocations(blockId)) for (loc <- locations) { logDebug(s"Getting remote block $blockId from $loc") - val data = blockTransferService.fetchBlockSync( -loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + val data = try { +blockTransferService.fetchBlockSync( + loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + } catch { +case e: Throwable => + logWarning(s"Exception during getting remote block $blockId from $loc", e) --- End diff -- Thanks @srowen and @CodingCat for comments! * If i am understanding correctly,the `doGetRemote` here will return `None` when fetched all the same block is null, and all the method which called the `doGetRemote` will handle the `None` case and throw exception when necessary,so i think it's safe here to catch the exception * `fetchBlockSync` just call `fetchBlocks` to fetch the block,so i think it's the same we catch exception here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/7927 [SPARK-9591][CORE]Job may fail for exception during getting broadcast variable [SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591) When we getting the broadcast variable, we can fetch the block form several location,but now when connecting the lost blockmanager(idle for enough time removed by driver when using dynamic resource allocate and so on) will cause task fail,and the worse case will cause the job fail. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark catch_exception Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7927.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7927 commit f8340a2ec880da41a36e3ae8bf87c5ac5badc919 Author: jeanlyn Date: 2015-08-04T09:23:29Z catch exception avoid task fail --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7165] [SQL] use sort merge join for out...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/5717#discussion_r32891298 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoin.scala --- @@ -82,86 +130,169 @@ case class SortMergeJoin( override final def next(): InternalRow = { if (hasNext) { -// we are using the buffered right rows and run down left iterator -val joinedRow = joinRow(leftElement, rightMatches(rightPosition)) -rightPosition += 1 -if (rightPosition >= rightMatches.size) { - rightPosition = 0 - fetchLeft() - if (leftElement == null || keyOrdering.compare(leftKey, matchKey) != 0) { -stop = false -rightMatches = null +if (bufferedMatches == null || bufferedMatches.size == 0) { + // we just found a row with no join match and we are here to produce a row + // with this row and a standard null row from the other side. + if (continueStreamed) { +val joinedRow = smartJoinRow(streamedElement, bufferedNullRow) +fetchStreamed() +joinedRow + } else { +val joinedRow = smartJoinRow(streamedNullRow, bufferedElement) +fetchBuffered() +joinedRow + } +} else { + // we are using the buffered right rows and run down left iterator + val joinedRow = smartJoinRow(streamedElement, bufferedMatches(bufferedPosition)) + bufferedPosition += 1 + if (bufferedPosition >= bufferedMatches.size) { +bufferedPosition = 0 +if (joinType != FullOuter || secondStreamedElement == null) { + fetchStreamed() + if (streamedElement == null || keyOrdering.compare(streamedKey, matchKey) != 0) { +stop = false +bufferedMatches = null + } +} else { + // in FullOuter join and the first time we finish the match buffer, + // we still want to generate all rows with streamed null row and buffered + // rows that match the join key but not the conditions. + streamedElement = secondStreamedElement + bufferedMatches = secondBufferedMatches + secondStreamedElement = null + secondBufferedMatches = null +} } + joinedRow } -joinedRow } else { // no more result throw new NoSuchElementException } } -private def fetchLeft() = { - if (leftIter.hasNext) { -leftElement = leftIter.next() -leftKey = leftKeyGenerator(leftElement) +private def smartJoinRow(streamedRow: InternalRow, bufferedRow: InternalRow): InternalRow = + joinType match { +case RightOuter => joinRow(bufferedRow, streamedRow) +case _ => joinRow(streamedRow, bufferedRow) + } + +private def fetchStreamed(): Unit = { + if (streamedIter.hasNext) { +streamedElement = streamedIter.next() +streamedKey = streamedKeyGenerator(streamedElement) } else { -leftElement = null +streamedElement = null } } -private def fetchRight() = { - if (rightIter.hasNext) { -rightElement = rightIter.next() -rightKey = rightKeyGenerator(rightElement) +private def fetchBuffered(): Unit = { + if (bufferedIter.hasNext) { +bufferedElement = bufferedIter.next() +bufferedKey = bufferedKeyGenerator(bufferedElement) } else { -rightElement = null +bufferedElement = null } } private def initialize() = { - fetchLeft() - fetchRight() + fetchStreamed() + fetchBuffered() } /** * Searches the right iterator for the next rows that have matches in left side, and store * them in a buffer. + * When this is not a Inner join, we will also return true when we get a row with no match + * on the other side. This search will jump out every time from the same position until + * `next(
[GitHub] spark pull request: [SPARK-7165] [SQL] use sort merge join for out...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/5717#discussion_r32891269 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoin.scala --- @@ -82,86 +130,169 @@ case class SortMergeJoin( override final def next(): InternalRow = { if (hasNext) { -// we are using the buffered right rows and run down left iterator -val joinedRow = joinRow(leftElement, rightMatches(rightPosition)) -rightPosition += 1 -if (rightPosition >= rightMatches.size) { - rightPosition = 0 - fetchLeft() - if (leftElement == null || keyOrdering.compare(leftKey, matchKey) != 0) { -stop = false -rightMatches = null +if (bufferedMatches == null || bufferedMatches.size == 0) { + // we just found a row with no join match and we are here to produce a row + // with this row and a standard null row from the other side. + if (continueStreamed) { +val joinedRow = smartJoinRow(streamedElement, bufferedNullRow) +fetchStreamed() +joinedRow + } else { +val joinedRow = smartJoinRow(streamedNullRow, bufferedElement) +fetchBuffered() +joinedRow + } +} else { + // we are using the buffered right rows and run down left iterator + val joinedRow = smartJoinRow(streamedElement, bufferedMatches(bufferedPosition)) + bufferedPosition += 1 + if (bufferedPosition >= bufferedMatches.size) { +bufferedPosition = 0 +if (joinType != FullOuter || secondStreamedElement == null) { + fetchStreamed() --- End diff -- I think we should use `boundCondition ` to update `bufferedMatches ` after we `fetchStreamed ()` .Otherwise we may get wrong answer.For example ```sql table a(key int,value int);table b(key int,value int) data of a 1 3 1 1 2 1 2 3 data of b 1 1 2 1 select a.key,b.key,a.value-b.value from a left outer join b on a.key=b.key and a.value - b.value > 1 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7165] [SQL] use sort merge join for out...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/5717#discussion_r32891271 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -90,13 +90,12 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { left.statistics.sizeInBytes <= sqlContext.conf.autoBroadcastJoinThreshold => makeBroadcastHashJoin(leftKeys, rightKeys, left, right, condition, joins.BuildLeft) - // If the sort merge join option is set, we want to use sort merge join prior to hashjoin - // for now let's support inner join first, then add outer join - case ExtractEquiJoinKeys(Inner, leftKeys, rightKeys, condition, left, right) + // If the sort merge join option is set, we want to use sort merge join prior to hashjoin. + // And for outer join, we can not put conditions outside of the join + case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right) if sqlContext.conf.sortMergeJoinEnabled => -val mergeJoin = - joins.SortMergeJoin(leftKeys, rightKeys, planLater(left), planLater(right)) -condition.map(Filter(_, mergeJoin)).getOrElse(mergeJoin) :: Nil +joins.SortMergeJoin( + leftKeys, rightKeys, joinType, planLater(left), planLater(right), condition) :: Nil --- End diff -- Shall we move the code to a new `Strategy `(like SortMergeJoin) instead of mix in `Hashjoin`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/6833#issuecomment-112636409 @chenghao-intel ,I think it only affect the dynamic partition.Because `SparkHadoopWriter` get the write by `OutputFormat.getRecordWriter`,most of them use the `FileOutputFormat.getTaskOutputPath` to get the path --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/6833#discussion_r32592438 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala --- @@ -230,7 +230,15 @@ private[spark] class SparkHiveDynamicPartitionWriterContainer( val path = { val outputPath = FileOutputFormat.getOutputPath(conf.value) --- End diff -- Oh,I try it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/6833#discussion_r32492419 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -197,7 +197,6 @@ case class InsertIntoHiveTable( table.hiveQlTable.getPartCols().foreach { entry => orderedPartitionSpec.put(entry.getName, partitionSpec.get(entry.getName).getOrElse("")) } - val partVals = MetaStoreUtils.getPvals(table.hiveQlTable.getPartCols, partitionSpec) --- End diff -- I think https://github.com/apache/spark/pull/5876/files#diff-d579db9a8f27e0bbef37720ab14ec3f6L203 should remove this code. @marmbrus. Right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/6833#discussion_r32491951 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -197,7 +197,6 @@ case class InsertIntoHiveTable( table.hiveQlTable.getPartCols().foreach { entry => orderedPartitionSpec.put(entry.getName, partitionSpec.get(entry.getName).getOrElse("")) } - val partVals = MetaStoreUtils.getPvals(table.hiveQlTable.getPartCols, partitionSpec) --- End diff -- This code seems never use,so remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/6833 [SPARK-8379][SQL]avoid speculative tasks write to the same file The issue link [SPARK-8379](https://issues.apache.org/jira/browse/SPARK-8379) Currently,when we insert data to the dynamic partition with speculative tasks we will get the Exception ``` org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57 ``` This pr try to write the data to temporary dir when using dynamic parition avoid the speculative tasks writing the same file You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark speculation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6833.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6833 commit e19a3bd77b6b9f44479e51659e244e9809b2963d Author: jeanlyn Date: 2015-06-15T16:38:16Z avoid speculative tasks write same file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/6682 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/6682#issuecomment-109871137 @yhuai .Yes,the full outer join cases shuffled the null key to the same reducer in spark-sql ,and the hive plan generated like: ```sql explain select a.value,b.value,c.value,d.value from a full outer join b on a.key = b.key full outer join c on a.key = c.key full outer join d on a.key = d.key STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: a Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: key (type: string) sort order: + Map-reduce partition columns: key (type: string) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: value (type: string) TableScan alias: b Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: key (type: string) sort order: + Map-reduce partition columns: key (type: string) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: value (type: string) TableScan alias: c Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: key (type: string) sort order: + Map-reduce partition columns: key (type: string) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: value (type: string) TableScan alias: d Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator key expressions: key (type: string) sort order: + Map-reduce partition columns: key (type: string) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: value (type: string) Reduce Operator Tree: Join Operator condition map: Outer Join 0 to 1 Outer Join 0 to 2 Outer Join 0 to 3 keys: 0 key (type: string) 1 key (type: string) 2 key (type: string) 3 key (type: string) outputColumnNames: _col1, _col6, _col11, _col16 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: _col1 (type: string), _col6 (type: string), _col11 (type: string), _col16 (type: string) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink ``` @chenghao-intel has a solution in #6413 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/6682#issuecomment-109777121 @yhuai ,Thanks for comment.In the current implementation of `join(BinaryNode)` in master just simply use the one side partitioning as its partitioning to judge whether need shuffle and ignore the other side partitioning which already partition.This may cause unnecessary shuffle on multiway join.For example: ```sql table a(key string,value string) table b(key string,value string) table c(key string,value string) table d(key string,value string) table e(key string,value string) select a.value,b.value,c.value,d.value,e.value from a join b on a.key = b.key join c on a.key = c.key join d on b.key = d.key join e on c.key = e.key ``` we got ``` Project [value#63,value#65,value#67,value#69,value#71] ShuffledHashJoin [key#66], [key#70], BuildRight Exchange (HashPartitioning [key#66], 200) Project [value#63,key#66,value#67,value#65,value#69] ShuffledHashJoin [key#64], [key#68], BuildRight Exchange (HashPartitioning [key#64], 200) Project [value#63,key#66,key#64,value#67,value#65] ShuffledHashJoin [key#62], [key#66], BuildRight ShuffledHashJoin [key#62], [key#64], BuildRight Exchange (HashPartitioning [key#62], 200) HiveTableScan [key#62,value#63], (MetastoreRelation default, a, None), None Exchange (HashPartitioning [key#64], 200) HiveTableScan [key#64,value#65], (MetastoreRelation default, b, None), None Exchange (HashPartitioning [key#66], 200) HiveTableScan [key#66,value#67], (MetastoreRelation default, c, None), None Exchange (HashPartitioning [key#68], 200) HiveTableScan [key#68,value#69], (MetastoreRelation default, d, None), ``` But actually we just need ``` Project [value#59,value#61,value#63,value#65,value#67] ShuffledHashJoin [key#62], [key#66], BuildRight Project [value#63,value#61,value#65,value#59,key#62] ShuffledHashJoin [key#60], [key#64], BuildRight Project [value#63,value#61,key#60,value#59,key#62] ShuffledHashJoin [key#58], [key#62], BuildRight ShuffledHashJoin [key#58], [key#60], BuildRight Exchange (HashPartitioning 200) HiveTableScan [key#58,value#59], (MetastoreRelation default, a, None), None Exchange (HashPartitioning 200) HiveTableScan [key#60,value#61], (MetastoreRelation default, b, None), None Exchange (HashPartitioning 200) HiveTableScan [key#62,value#63], (MetastoreRelation default, c, None), None Exchange (HashPartitioning 200) HiveTableScan [key#64,value#65], (MetastoreRelation default, d, None), None Exchange (HashPartitioning 200) HiveTableScan [key#66,value#67], (MetastoreRelation default, e, None), None ``` This will greatly improve the efficiency,especially for the outer join. We had some real world cases of multiway full outer join with the same key,it produce a lot of null key(causing data skew,while hive doesn't) with redundancy shuffle and ran OOM finally. I want to try using the `meetPartitions`,we can save the both side `outputPartitioning` of the `BinaryNode` and the itself `outputPartitioning`(redundancy?) when constructing the plan tree to achieve this easily,and the `meetPartitions` will be reset to the node outputPartitioning when need shuffle to avoid removing the indeed `Exchange` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/6682#issuecomment-109564743 cc @yhuai @chenghao-intel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/6682 [SPARK-2205][SPARK-7871][SQL]Advoid redundancy exchange When only use the output partitioning of `BinaryNode` will probably add unnecessary `Exchange` like multiway join. This PR add `meetPartitions ` to SparkPlan advoid redundancy exchanges by use to save the partitioning of the node and the child,and will be reset to node partitioning when need shuffle. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark addMeetRequire Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6682.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6682 commit 7e52db3e6ff6e2dffe0c599730b8019319fda185 Author: jeanlyn Date: 2015-06-06T10:22:57Z remove unnecessary exchange --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8020] Spark SQL in spark-defaults.conf ...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/6563#discussion_r31490129 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -37,6 +38,48 @@ class VersionsSuite extends SparkFunSuite with Logging { "hive.metastore.warehouse.dir" -> warehousePath.toString) } + test("SPARK-8020: successfully create a HiveContext with metastore settings in Spark conf.") { +val sparkConf = + new SparkConf() { +// We are not really clone it. We need to keep the custom getAll. +override def clone: SparkConf = this + +override def getAll: Array[(String, String)] = { + val allSettings = super.getAll + val metastoreVersion = get("spark.sql.hive.metastore.version") + val metastoreJars = get("spark.sql.hive.metastore.jars") + + val others = allSettings.filterNot { case (key, _) => +key == "spark.sql.hive.metastore.version" || key == "spark.sql.hive.metastore.jars" + } + + // Put metastore.version to the first one. It is needed to trigger the exception + // caused by SPARK-8020. Other problems triggered by SPARK-8020 + // (e.g. using Hive 0.13.1's metastore client to connect to the a 0.12 metastore) + // are not easy to test. + Array( +("spark.sql.hive.metastore.version" -> metastoreVersion), +("spark.sql.hive.metastore.jars" -> metastoreJars)) ++ others +} + } +sparkConf + .set("spark.sql.hive.metastore.version", "12") --- End diff -- Does `12` equate to `0.12.0` or `1.2`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6908] [SQL] Use isolated Hive client
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5876#issuecomment-107459171 I can use the build in class(0.13.1) to connect `0.12.0` metastore correctly except some warn and error which does not effect running ``` 5/06/01 21:20:09 WARN metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect. org.apache.thrift.TApplicationException: Invalid method name: 'get_functions' at org.apache.thrift.TApplicationException.read(TApplicationException.java:108) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_functions(ThriftHiveMetastore.java:2886) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_functions(ThriftHiveMetastore.java:2872) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunctions(HiveMetaStoreClient.java:1727) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at $Proxy10.getFunctions(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getFunctions(Hive.java:2670) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:674) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:662) at org.apache.hadoop.hive.cli.CliDriver.getCommandCompletor(CliDriver.java:540) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:174) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/06/01 21:20:10 INFO hive.metastore: Trying to connect to metastore with URI thrift://172.19.154.28:9084 15/06/01 21:20:10 INFO hive.metastore: Connected to metastore. 15/06/01 21:20:10 ERROR exec.FunctionRegistry: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_functions' spark-sql> ``` Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6908] [SQL] Use isolated Hive client
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5876#issuecomment-107415482 I set the metastore to `0.12.0` by follow steps ,but get classnotdef exception: * I chang the `spark.sql.hive.metastore.version` in `spark-defaults.conf` to `0.12.0`,i got * set `spark.sql.hive.metastore.jars` to `maven`,or set `spark.sql.hive.metastore.jars` to classpath of hadoop and hive i got ``` java.lang.NoClassDefFoundError: com/google/common/base/Preconditions when creating Hive client using classpath ``` I am not sure i am understand correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7885][SQL]add config to control map agg...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/6426#issuecomment-107059001 Thanks @rxin for informations.I close this PR for now and will reopen it once be optimized. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7885][SQL]add config to control map agg...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/6426 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7871] [SQL] Improve the outputPartition...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/6413#discussion_r31158220 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala --- @@ -32,6 +32,26 @@ import org.apache.spark.sql.types._ class PlannerSuite extends FunSuite { + test("multiway full outer join") { +val planned = testData + .join(testData2, testData("key") === testData2("a"), "outer") + .join(testData3, testData("key") === testData3("a"), "outer") + .queryExecution.executedPlan +val exchanges = planned.collect { case n: Exchange => n } + +assert(exchanges.size === 3) --- End diff -- Is these changs doesn't effect to ```scala testData .join(testData2, testData("key") === testData2("a"), "outer") .join(testData2, testData("a") === testData3("a"), "outer") ``` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7885][SQL]add config to control map agg...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/6426#issuecomment-105759184 Thanks @JoshRosen @rxin for comment.I had meet at lease two group by cases of our production environment run long GC time and finally executor crash.These case has a comment characteristic: The input split is large(more than 128MB),and the key is also large.But had no data skew.because when i turn `spark.sql.partialAggregation.enable` false,we can get the result quickly.And each reduce task handle the data sizes almost the same. @rxin Do you had more informations about `refactor aggregate/UDAF interface`.Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7885][SQL]add config to control map agg...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/6426 [SPARK-7885][SQL]add config to control map aggregation in spark sql [SPARK-7885](https://issues.apache.org/jira/browse/SPARK-7885),we add `spark.sql.partialAggregation.enable`,it's true by default,we can set false to make map aggregation unable to avoid gc problem.For example,we run the sql ```sql insert overwrite table groupbytest select sale_ord_id as order_id, coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount, coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) + coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount, coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount, coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount, 0.0 as telecom_point_offer_amount, coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount, coalesce(sum(jq_pay_amount),0.0) + coalesce(sum(pop_shop_jq_pay_amount),0.0) + coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount, coalesce(sum(dq_pay_amount),0.0) + coalesce(sum(pop_shop_dq_pay_amount),0.0) + coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount, coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount , coalesce(sum(mobile_red_packet_pay_amount),0.0) as mobile_red_packet_pay_amount, coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount, coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount, coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount, coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount, coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount fromord_at_det_di where ds = '2015-05-20' group by sale_ord_id ``` use 6 executor, each executor has 8GB memory and 2 cpu,we got gc problems during the map aggregation and finally the executor crash ![5869030a-d924-4249-9e1d-c637caa9363a](https://cloud.githubusercontent.com/assets/3426093/7828153/4afdaf88-0462-11e5-8af0-3bff04edab92.png) When we set `spark.sql.partialAggregation.enable` false ,the sql run in 2 min You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark partialAggregation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6426.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6426 commit b17c676bb0d33019bbdd124048221595f278b9d0 Author: jeanlyn Date: 2015-05-27T03:03:47Z add config to control map aggregation in spark sql --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2926][Shuffle]Add MR style sort-merge s...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/3438#discussion_r30295775 --- Diff: core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleReader.scala --- @@ -0,0 +1,337 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle.sort + +import java.io.File +import java.io.FileOutputStream +import java.nio.ByteBuffer +import java.util.Comparator + +import scala.collection.mutable.{ArrayBuffer, HashMap, Queue} +import scala.util.{Failure, Success, Try} + +import org.apache.spark._ +import org.apache.spark.executor.ShuffleWriteMetrics +import org.apache.spark.network.buffer.{ManagedBuffer, NioManagedBuffer} +import org.apache.spark.serializer.Serializer +import org.apache.spark.shuffle.{BaseShuffleHandle, FetchFailedException, ShuffleReader} +import org.apache.spark.storage._ +import org.apache.spark.util.{CompletionIterator, Utils} +import org.apache.spark.util.collection.{MergeUtil, TieredDiskMerger} + +/** + * SortShuffleReader merges and aggregates shuffle data that has already been sorted within each + * map output block. + * + * As blocks are fetched, we store them in memory until we fail to acquire space from the + * ShuffleMemoryManager. When this occurs, we merge some in-memory blocks to disk and go back to + * fetching. + * + * TieredDiskMerger is responsible for managing the merged on-disk blocks and for supplying an + * iterator with their merged contents. The final iterator that is passed to user code merges this + * on-disk iterator with the in-memory blocks that have not yet been spilled. + */ +private[spark] class SortShuffleReader[K, C]( +handle: BaseShuffleHandle[K, _, C], +startPartition: Int, +endPartition: Int, +context: TaskContext) + extends ShuffleReader[K, C] with Logging { + + /** Manage the fetched in-memory shuffle block and related buffer */ + case class MemoryShuffleBlock(blockId: BlockId, blockData: ManagedBuffer) + + require(endPartition == startPartition + 1, +"Sort shuffle currently only supports fetching one partition") + + private val dep = handle.dependency + private val conf = SparkEnv.get.conf + private val blockManager = SparkEnv.get.blockManager + private val ser = Serializer.getSerializer(dep.serializer) + private val shuffleMemoryManager = SparkEnv.get.shuffleMemoryManager + + private val fileBufferSize = conf.getInt("spark.shuffle.file.buffer.kb", 32) * 1024 + + /** Queue to store in-memory shuffle blocks */ + private val inMemoryBlocks = new Queue[MemoryShuffleBlock]() + + /** + * Maintain block manager and reported size of each shuffle block. The block manager is used for + * error reporting. The reported size, which, because of size compression, may be slightly + * different than the size of the actual fetched block, is used for calculating how many blocks + * to spill. + */ + private val shuffleBlockMap = new HashMap[ShuffleBlockId, (BlockManagerId, Long)]() + + /** keyComparator for mergeSort, id keyOrdering is not available, +* using hashcode of key to compare */ + private val keyComparator: Comparator[K] = dep.keyOrdering.getOrElse(new Comparator[K] { +override def compare(a: K, b: K) = { + val h1 = if (a == null) 0 else a.hashCode() + val h2 = if (b == null) 0 else b.hashCode() + if (h1 < h2) -1 else if (h1 == h2) 0 else 1 +} + }) + + /** A merge thread to merge on-disk blocks */ + private val tieredMerger = new TieredDiskMerger(conf, dep, keyComparator, context) + + /** Shuffle block fetcher iterator */ + private var shuffleRawBlockFetcherItr: ShuffleRawBlockFetcherIterator = _ + + /** Number of bytes spilled in memory and on disk */ + private var _memoryBytesSpilled: Long = 0L
[GitHub] spark pull request: [SQL][Minor] typo
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5220#issuecomment-86787963 @chenghao-intel ,I think #5198 had fixed the problem --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5794] [SQL] [WIP] fix add jar
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/4586#discussion_r26954661 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -76,7 +77,8 @@ class HadoopTableReader( override def makeRDDForTable(hiveTable: HiveTable): RDD[Row] = makeRDDForTable( hiveTable, - relation.tableDesc.getDeserializerClass.asInstanceOf[Class[Deserializer]], + Class.forName(relation.tableDesc.getSerdeClassName, true, sc.sparkContext.getClassLoader) --- End diff -- Shall we also do this in `makeRDDForPartitionedTable `? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/5079 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-85085332 After communicated with @adrian-wang offline. I realized this PR still leave some class loader problem.So i close this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-83383402 I also don't have CHAR in `mapjoin_addjar.q`. I only find one `mapjoin_addjar.q`,and the path of my file is sql/hive/src/test/resources/ql/src/test/queries/clientpositive/mapjoin_addjar.q ```sql set hive.auto.convert.join=true; set hive.auto.convert.join.use.nonstaged=false; add jar ${system:maven.local.repository}/org/apache/hive/hcatalog/hive-hcatalog-core/${system:hive.version}/hive-hcatalog-core-${system:hive.version}.jar; CREATE TABLE t1 (a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' ; LOAD DATA LOCAL INPATH "../../data/files/sample.json" INTO TABLE t1; select * from src join t1 on src.key =t1.a; drop table t1; set hive.auto.convert.join=false; ``` May be we can discuss this offline? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-83372419 @chenghao-intel my full code is ```java import org.apache.hadoop.hive.ql.exec.UDF; public class hello extends UDF { public String evaluate(String str) { try { return "hello " + str; } catch (Exception e) { return null; } } } ``` @adrian-wang ,I also test the `mapjoin_addjar.q` in `spark-sql`. I got the exception when `CREATE TABLE ` ``` 15/03/19 14:41:36 ERROR DDLTask: java.lang.NoSuchFieldError: CHAR at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:310) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:277) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) ``` But it seems that not the load jar problem.Because when i not run the ``` add jar ${system:maven.local.repository}/org/apache/hive/hcatalog/hive-hcatalog-core/${system:hive.version}/hive-hcatalog-core-${system:hive.version}.jar; ``` I got the follow exception ``` 15/03/19 14:54:51 ERROR DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe at org.apache.hadoop.hive.ql.exec.DDLTask.validateSerDe(DDLTask.java:3423) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3553) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-83348085 @adrian-wang ,I had tested in `spark-sql` ,and get result correctly with my test case. Can you provide your test case?By the way,when i debug this issue i found in the `thrifter-server` mode,it also reuse the `SparkContext.addJar`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-83299775 @adrian-wang You mean not work in `spark-shell` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-83267913 @chenghao-intel I am not clear what problem #4586 try to fix.If #4586 try to fix the problem as I mentioned.I think we can reuse the `SparkContext.addJar` is enough to fix the class loading/unloading problem.Right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-83254407 @yhuai ,There is a simple functions ```java public String evaluate(String str) { try { return "hello " + str; } catch (Exception e) { return null; } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-82976538 Ok. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-82970359 Thanks @liancheng for explain.You are right,it need consider more about it.So,should i remove the test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-82926819 Hi, @marmbrus ,I had update the code as you mentioned about. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/5079#issuecomment-82908034 Updated, @liancheng @marmbrus I had tried to add a test for this patch,could you take a look for the test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...
GitHub user jeanlyn opened a pull request: https://github.com/apache/spark/pull/5079 [SPARK-6392][SQL]Minor fix ClassNotFound exception when use spark cli to add jar When we use spark cli to add jar dynamic,we will get the `java.lang.ClassNotFoundException` when we use the class of jar to create udf.For example: ```sql spark-sql> add jar /home/jeanlyn/hello.jar; spark-sql>create temporary function hello as 'hello'; spark-sql>select hello(name) from person; Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassNotFoundException: hello ``` we can use the spark physical plan to fix this problem You can merge this pull request into a Git repository by running: $ git pull https://github.com/jeanlyn/spark SPARK-6392 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5079.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5079 commit ca958493957e7cc5f49dc318bcbf213793a991c0 Author: jeanlyn Date: 2015-03-18T02:02:35Z use spark physical plan when add jar --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5794] [SQL] [WIP] fix add jar
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/4586#discussion_r26595976 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala --- @@ -263,7 +263,8 @@ private[hive] class SparkSQLCLIDriver extends CliDriver with Logging { val proc: CommandProcessor = HiveShim.getCommandProcessor(Array(tokens(0)), hconf) if (proc != null) { -if (proc.isInstanceOf[Driver] || proc.isInstanceOf[SetProcessor]) { +if (proc.isInstanceOf[Driver] || proc.isInstanceOf[SetProcessor] || + proc.isInstanceOf[AddResourceProcessor]) { --- End diff -- agree --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-82383269 Updated, @marmbrus @chenghao-intel . We had tested this patch in our environment over the past few days.Any more problems in this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/3907#issuecomment-75260761 OK.I close this one --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/3891#issuecomment-75260856 OK.I close this one --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/3907 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/3891 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-73877412 /cc @marmbrus --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-73868381 @chenghao-intel ,I had pass all unit test in my local .But i think the unit test of thrift-server seems unstable,it's depend on the state of the machine,when the machine is busy,it may time out during the unit test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-73837462 Hi,@marmbrus , @chenghao-intel I have no idea why `SPARK-4407 regression: Complex type support` this test failed after i resolved the merge conflicts.It seems that not my problems,because i had passed this unit tests before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-73836885 Retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-73459718 Thanks @chenghao-intel for review and suggestions!I take some of your advises to simplify the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/4289#issuecomment-73176093 hi,@chenghao-intel @marmbrus any suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...
Github user jeanlyn commented on a diff in the pull request: https://github.com/apache/spark/pull/4289#discussion_r24139046 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -264,15 +268,31 @@ private[hive] object HadoopTableReader extends HiveInspectors { * @param nonPartitionKeyAttrs Attributes that should be filled together with their corresponding * positions in the output schema * @param mutableRow A reusable `MutableRow` that should be filled + * @param convertdeserializer The `Deserializer` covert the `deserializer` * @return An `Iterator[Row]` transformed from `iterator` */ def fillObject( iterator: Iterator[Writable], deserializer: Deserializer, nonPartitionKeyAttrs: Seq[(Attribute, Int)], - mutableRow: MutableRow): Iterator[Row] = { + mutableRow: MutableRow, + convertdeserializer: Option[Deserializer] = None): Iterator[Row] = { --- End diff -- But the `val soi` also need a convert deserializer when the schema doesn't match --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org