[GitHub] spark issue #11228: [SPARK-13356][Streaming]WebUI missing input informations...

2016-11-07 Thread jeanlyn
Github user jeanlyn commented on the issue:

https://github.com/apache/spark/pull/11228
  
@tdas I had added a unit test to this PR, could you take some time to 
review ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11228: [SPARK-13356][Streaming]WebUI missing input informations...

2016-10-26 Thread jeanlyn
Github user jeanlyn commented on the issue:

https://github.com/apache/spark/pull/11228
  
@tdas OK, i will try to add unit test these day.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11228: [SPARK-13356][Streaming]WebUI missing input informations...

2016-08-07 Thread jeanlyn
Github user jeanlyn commented on the issue:

https://github.com/apache/spark/pull/11228
  
@vanzin Sorry for the late reply, I had solved the conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...

2016-04-05 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/12150#issuecomment-206053030
  
OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...

2016-04-05 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/12150


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...

2016-04-04 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/12150#issuecomment-205341839
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14243][CORE]update task metrics when re...

2016-04-04 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/12091#issuecomment-205334309
  
@andrewor14 Sorry for the late reply, I had submitted a patch for 1.6 #12150


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...

2016-04-04 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/12150#issuecomment-205334579
  
/cc @andrewor14 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14243][CORE][BACKPORT-1.6]update task m...

2016-04-04 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/12150

 [SPARK-14243][CORE][BACKPORT-1.6]update task metrics when removing blocks

## What changes were proposed in this pull request?

This patch try to  update the `updatedBlockStatuses ` when removing blocks, 
making sure `BlockManager` correctly updates `updatedBlockStatuses`


## How was this patch tested?

test("updated block statuses") in BlockManagerSuite.scala


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark updataBlock1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12150


commit ff151af2d2495f99c3c22a3f17c78f5ff4fd9402
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-04-04T14:35:14Z

add metrics when removing blocks




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14243][CORE]update task metrics when re...

2016-03-31 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/12091#issuecomment-204006712
  
cc @andrewor14 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: update task metrics when removing blocks

2016-03-31 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/12091

update task metrics when removing blocks

## What changes were proposed in this pull request?

This PR try to use `incUpdatedBlockStatuses ` to update the 
`updatedBlockStatuses ` when removing blocks, making sure `BlockManager` 
correctly updates `updatedBlockStatuses`


## How was this patch tested?

test("updated block statuses") in BlockManagerSuite.scala

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark updateBlock

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12091.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12091


commit a97d220a96ed1f511f617f0f98663b82f2fd26ed
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-30T16:01:20Z

update metrics when removing blocks




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE][Backport-1.6]Using onBlock...

2016-03-29 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/12028#issuecomment-203173603
  
OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE][Backport-1.6]Using onBlock...

2016-03-29 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/12028


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE][Backport-1.6]Using onBlock...

2016-03-29 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/12028#issuecomment-202777494
  
/cc @andrewor14 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE][Backport-1.6]Using onBlock...

2016-03-29 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/12028

[SPARK-13845][CORE][Backport-1.6]Using onBlockUpdated to replace onTaskEnd 
avioding driver OOM

## What changes were proposed in this pull request?

We have a streaming job using `FlumePollInputStream` always driver OOM 
after few days, here is some driver heap dump before OOM
```
 num #instances #bytes  class name
--
   1:  13845916  553836640  org.apache.spark.storage.BlockStatus
   2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
   3:  13883881  333213144  scala.collection.mutable.DefaultEntry
   4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
   5: 62360   65107352  [B
   6:163368   24453904  [Ljava.lang.Object;
   7:293651   20342664  [C
...
```
`BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in 
the end.
After investigated, i found the `executorIdToStorageStatus` in 
`StorageStatusListener` seems never remove the blocks from `StorageStatus`. 
In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd 
` , so we can update the block informations(add blocks, drop the block from 
memory to disk and delete the blocks) in time.


## How was this patch tested?

Existing unit tests and manual tests




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark fixoom1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12028.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12028


commit f37f3594b71df0083880aa5395a77808fd6c72b3
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-29T05:06:03Z

Using onBlockUpdated to replace onTaskEnd avioding driver OOM




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-28 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11679#issuecomment-202653414
  
OK, I will fill a JIRA later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-28 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11779#issuecomment-202649212
  
Sure. I will submit a patch against branch-1.6


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-19 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/11779

[SPARK-13845][CORE]Using onBlockUpdated to replace onTaskEnd avioding 
driver OOM

## What changes were proposed in this pull request?

We have a streaming job using `FlumePollInputStream` always driver OOM 
after few days, here is some driver heap dump before OOM
```
 num #instances #bytes  class name
--
   1:  13845916  553836640  org.apache.spark.storage.BlockStatus
   2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
   3:  13883881  333213144  scala.collection.mutable.DefaultEntry
   4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
   5: 62360   65107352  [B
   6:163368   24453904  [Ljava.lang.Object;
   7:293651   20342664  [C
...
```
`BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in 
the end.
After investigated, i found the `executorIdToStorageStatus` in 
`StorageStatusListener` seems never remove the blocks from `StorageStatus`. 
In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd 
` , so we can update the block informations(add blocks, drop the block from 
memory to disk and delete the blocks) in time.


## How was this patch tested?

Existing unit tests and manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark fix_driver_oom

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11779


commit f1f6df685b416b1f7fa47f9419dc5a0300e63070
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-12T08:01:39Z

using onBlockUpdated to replace onTaskEnd when the block updated

commit 5255067c40b49efab597dfbb13ccfa9b6f7b0833
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-12T12:52:04Z

fix unit test

commit ab4a863b71ee28791b43daea1af58453906af8c0
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-15T01:01:53Z

fix scala style

commit 59150927dfe7c1eacd244a359fc5e8d040d4dc07
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-17T06:24:01Z

fix HistoryServerSuite

commit 2e2c319cd0dafe3ee8ae8d4fb8fe958ae5ddcd4e
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-17T09:16:39Z

fix typos




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-19 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11779#issuecomment-197786170
  
This PR is the same as #11679 , but i came across with some accidents when 
rebasing the PR. So i create a new one. 
/cc @andrewor14 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-19 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11679#issuecomment-197674341
  
All test failure is relevant with `HistoryServerSuite`, the reason is we 
remove the `onTaskEnd`, and it's used to replay the storage page of history 
server from the even log. It seems that we don't want to log the 
`onBlockUpdated` even to the file. 
See:https://github.com/apache/spark/blob/184085284185011d7cc6d054b54d2d38eaf1dd77/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L196
 Hence, is acceptable to generate the new `*_expectation.json` of   
`src/test/resources/HistoryServerExpectations/` to pass the unit test for now? 
@andrewor14 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-19 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11679#issuecomment-197778717
  
close this for accident


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-18 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/11679


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-15 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11679#issuecomment-197127323
  
I think the the metrics of `updatedBlockStatuses` does not updated using 
code like 
```
c.taskMetrics().incUpdatedBlockStatuses(Seq((blockId, status)))
```
 in `BlockManager.removeBlock` and the method invoke `removeBlock` when 
removing the blocks cause the problem.
see:

master:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1226-1226

branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108

branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
So when the the task end, we can not update the `executorIdToStorageStatus` 
in `StorageStatusListener` correctly. Let me know if my understanding is wrong.
Forgive me for didn't explain the problem clearly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-14 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11679#issuecomment-196617986
  
@andrewor14 It seems the MIMA failure do not relevant with this patch. Do i 
need to fix it in this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13845][CORE]Using onBlockUpdated to rep...

2016-03-14 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11679#issuecomment-196591027
  
Thanks @andrewor14 for review. We encounter this issue in branch-1.5, and I 
had noticed recent changes of metrics. If i understand correctly, I think the 
root cause of this issue is the `metrics.updatedBlockStatuses` does not be 
updated in `BlockManager.removeBlock` when removing blocks, but i prefer to use 
`onBlockUpdated` to solute this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [CORE]Using onBlockUpdated to replace onTaskEn...

2016-03-12 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/11679

[CORE]Using onBlockUpdated to replace onTaskEnd avioding driver OOM

## What changes were proposed in this pull request?

We have a streaming job using `FlumePollInputStream` always driver OOM 
after few days, here is some driver heap dump before OOM
```
 num #instances #bytes  class name
--
   1:  13845916  553836640  org.apache.spark.storage.BlockStatus
   2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
   3:  13883881  333213144  scala.collection.mutable.DefaultEntry
   4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
   5: 62360   65107352  [B
   6:163368   24453904  [Ljava.lang.Object;
   7:293651   20342664  [C

```
`BlockStatus` and `StreamBlockId` keep on growing, and the driver OOM in 
the end.
After investigated, i found the `executorIdToStorageStatus` in 
`StorageStatusListener` seems never remove the blocks from `StorageStatus`. 
In order to fix the issue, i try to use `onBlockUpdated` replace `onTaskEnd 
` , so we can update the block informations(add blocks, drop the block from 
memory to disk and delete the blocks) in time.


## How was this patch tested?

Existing unit tests and manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark fix_driver_oom

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11679.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11679


commit f1f6df685b416b1f7fa47f9419dc5a0300e63070
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-12T08:01:39Z

using onBlockUpdated to replace onTaskEnd when the block updated

commit 5255067c40b49efab597dfbb13ccfa9b6f7b0833
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-12T12:52:04Z

fix unit test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/11440


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190994425
  
Thanks @jerryshao  @srowen @zsxwing for suggestions.I close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190613342
  
My bad. I will try to figure out the way to fix the when window operations 
appear with the config set to true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190608101
  
@jerryshao Thanks for the explanation. I see what you mean. It's only 
happen in the beginning, and if the stop time is much longer than the window 
time, i think it's acceptable to skip those down time batch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-02-29 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190568465
  
Thanks @jerryshao for suggestion!
> Jobs generated in the down time can be used for WAL replay, did you test 
when these down jobs are removed, the behavior of WAL replay is still correct?

It seems that the `pendingTimes` is use for WAL replay, i do not skip these 
batches 

> Also for some windowing operations, I think this removal of down time 
jobs may possibly lead to the inconsistent result of windowing aggregation.

Does inconsistent result mean wrong result?

Also, i will running the unit test with the config set to true by default 
in my local computer.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586]add config to skip generate down ...

2016-02-29 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/11440

[SPARK-13586]add config to skip generate down time batch when restart 
StreamingContext

## What changes were proposed in this pull request?

The patch try to add a config `spark.streaming.skipDownTimeBatch` to 
control whether generate the down time batches when restarting 
StreamingContext. By default, it will be set to false.


## How was this patch tested?

unit test: test("SPARK-13586: do no generate down time batch when 
recovering from checkpoint") 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark skipDownTime

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11440.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11440


commit 089d0af74317b767378c16673ac9d67f6dfd9972
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-02-29T07:15:38Z

add config to generate down time batch

commit 9068881aeb17bb77383b3e9eecf01c463f62113c
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-03-01T03:21:23Z

add jira num




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13356][Streaming]WebUI missing input in...

2016-02-16 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11228#issuecomment-185001455
  
@JoshRosen Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13356][Streaming]WebUI missing input in...

2016-02-16 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11228#issuecomment-184981662
  
@tdas @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13356][Streaming]WebUI missing input in...

2016-02-16 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/11228

[SPARK-13356][Streaming]WebUI missing input informations when recovering 
from dirver failure

Issue link:[SPARK-13356](https://issues.apache.org/jira/browse/SPARK-13356)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark checkpoint

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11228


commit c858d8398d22b158f0e36bdeeefb2d6941efd6da
Author: jeanlyn <jeanly...@gmail.com>
Date:   2016-02-16T15:06:13Z

report info when restarting streamingContext




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169524471
  
@marmbrus you are right. But i think @zhonghaihua 's solution is try to 
reduce cartesian product possibility, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2016-01-06 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-169563822
  
It's difference from join selection, it just pull out nondeterministic 
expressions of join condition to the left or right children, but it seems it 
can reuse the code of `ExtractEquiJoinKeys`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12125][SQL] pull out nondeterministic e...

2015-12-03 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/10128#issuecomment-161851642
  
@cloud-fan I think your case is different from @zhonghaihua 's. The sql 
only deal with some join keys ('' and null) before shuffle to handle those 
pointless key cause skew during join operator, while `repartition` deal with 
all data before some map operator. This is two type data skew, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-09-02 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/7927#issuecomment-137046713
  
It seems that the failure not related.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9192][SQL] add initialization phase for...

2015-08-31 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7535#discussion_r38289221
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -129,6 +128,14 @@ trait CheckAnalysis {
 failAnalysis(
   s"unresolved operator ${operator.simpleString}")
 
+  case o if o.expressions.exists(!_.deterministic) &&
+!o.isInstanceOf[Project] && !o.isInstanceOf[Filter] =>
+failAnalysis(
--- End diff --

Hi, @cloud-fan .Can it support for join operation? Sometimes we can use 
some `non deterministic ` expression to eval some pointless join keys(with 
respect to business logic) avoiding data skew.
For example
```sql
SELECT src.key, src.value, src1.value 
FROM src 
JOIN src1
ON UPPER((CASE WHEN (src.key IS NULL OR src.key = '' ) THEN CAST( (-RAND() 
* 1000 ) AS string ) ELSE src.key END )) = UPPER(src1.key)
```
What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9192][SQL] add initialization phase for...

2015-08-31 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7535#discussion_r38332979
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -129,6 +128,14 @@ trait CheckAnalysis {
 failAnalysis(
   s"unresolved operator ${operator.simpleString}")
 
+  case o if o.expressions.exists(!_.deterministic) &&
+!o.isInstanceOf[Project] && !o.isInstanceOf[Filter] =>
+failAnalysis(
--- End diff --

Yep, we need to change this sql manually after this behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10198][SQL] Turn off partition verifica...

2015-08-25 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/8404#discussion_r37867722
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -18,50 +18,54 @@
 package org.apache.spark.sql.hive
 
 import com.google.common.io.Files
+import org.apache.spark.sql.test.SQLTestUtils
 
 import org.apache.spark.sql.{QueryTest, _}
 import org.apache.spark.util.Utils
 
 
-class QueryPartitionSuite extends QueryTest {
+class QueryPartitionSuite extends QueryTest with SQLTestUtils {
 
   private lazy val ctx = org.apache.spark.sql.hive.test.TestHive
   import ctx.implicits._
-  import ctx.sql
+
+  protected def _sqlContext = ctx
 
   test(SPARK-5068: query data when path doesn't exist){
-val testData = ctx.sparkContext.parallelize(
-  (1 to 10).map(i = TestData(i, i.toString))).toDF()
-testData.registerTempTable(testData)
+withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, false)) {
--- End diff --

Should be set to `true`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-09 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36587468
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -590,10 +590,24 @@ private[spark] class BlockManager(
   private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
 require(blockId != null, BlockId is null)
 val locations = Random.shuffle(master.getLocations(blockId))
+var attemptTimes = 0
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case t: Throwable if attemptTimes  locations.size - 1 =
+  // Return null when Exception throw, so we can fetch block
+  // from another location if there still have locations
+  attemptTimes += 1
+  logWarning(sTry $attemptTimes times getting remote block 
$blockId from $loc failed., t)
+  null
+case t: Throwable =
+  // Throw BlockFetchException wraps the last Exception when
+  // there is no block we can fetch
+  throw new BlockFetchException(t)
--- End diff --

Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36294518
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +448,37 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test((SPARK-9591)getRemoteBytes from another location when  IOException 
throw) {
+try {
+  conf.set(spark.network.timeout, 2s)
+  store = makeBlockManager(8000, excutor1)
--- End diff --

My bad. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-05 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36302564
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +448,34 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test((SPARK-9591)getRemoteBytes from another location when  IOException 
throw) {
+try {
+  conf.set(spark.network.timeout, 2s)
+  store = makeBlockManager(8000, executor1)
+  store2 = makeBlockManager(8000, executor2)
+  store3 = makeBlockManager(8000, executor3)
+  val list1 = List(new Array[Byte](4000))
+  store2.putIterator(list1, list1.iterator, 
StorageLevel.MEMORY_ONLY, tellMaster = true)
+  store3.putIterator(list1, list1.iterator, 
StorageLevel.MEMORY_ONLY, tellMaster = true)
+  var list1Get = store.getRemoteBytes(list1)
+  assert(list1Get.isDefined, list1Get expected to be fetched)
+  // block manager exit
+  store2.stop()
+  store2 = null
+  list1Get = store.getRemoteBytes(list1)
+  // get `list1` block
+  assert(list1Get.isDefined, list1Get expected to be fetched)
+  store3.stop()
+  store3 = null
+  // exception throw because there is no locations
+  intercept[java.io.IOException] {
+list1Get = store.getRemoteBytes(list1)
+  }
--- End diff --

Yes,the test pass in my local laptop. I only catch `IOException` when there 
still has locations we can fetch the block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36182323
  
--- Diff: 
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ---
@@ -443,6 +443,21 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with BeforeAndAfterE
 assert(list2DiskGet.get.readMethod === DataReadMethod.Disk)
   }
 
+  test(block manager crash test) {
+conf.set(spark.network.timeout, 5s)
+store = makeBlockManager(8000)
+store2 = makeBlockManager(8000)
+val list1 = List(new Array[Byte](4000))
+store2.putIterator(list1, list1.iterator, StorageLevel.MEMORY_ONLY, 
tellMaster = true)
+val list1get = store.getRemote(list1)
--- End diff --

Thanks,I will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36182313
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

Thanks @srowen and @CodingCat for comments!
* If i am understanding correctly,the `doGetRemote` here will return `None` 
when fetched all the same block is null, and all the method which called the 
`doGetRemote` will handle the `None` case  and throw exception when 
necessary,so i think it's safe here to catch the exception
* `fetchBlockSync` just call `fetchBlocks` to fetch the block,so i think 
it's the same we catch exception here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/7927

[SPARK-9591][CORE]Job may  fail for exception during getting broadcast 
variable

[SPARK-9591](https://issues.apache.org/jira/browse/SPARK-9591) 
When we getting the broadcast variable, we can fetch the block form several 
location,but now when connecting the lost blockmanager(idle for enough time 
removed by driver when using dynamic resource allocate and so on) will cause 
task fail,and the worse case will cause the job fail.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark catch_exception

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7927.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7927


commit f8340a2ec880da41a36e3ae8bf87c5ac5badc919
Author: jeanlyn jeanly...@gmail.com
Date:   2015-08-04T09:23:29Z

catch exception avoid task fail




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36217402
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc - locations) {
   logDebug(sGetting remote block $blockId from $loc)
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =
+  logWarning(sException during getting remote block $blockId from 
$loc, e)
--- End diff --

Make sense,I will fix it later,thanks @squito a lot!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [SQL] use sort merge join for out...

2015-06-21 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/5717#discussion_r32891269
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoin.scala
 ---
@@ -82,86 +130,169 @@ case class SortMergeJoin(
 
 override final def next(): InternalRow = {
   if (hasNext) {
-// we are using the buffered right rows and run down left 
iterator
-val joinedRow = joinRow(leftElement, 
rightMatches(rightPosition))
-rightPosition += 1
-if (rightPosition = rightMatches.size) {
-  rightPosition = 0
-  fetchLeft()
-  if (leftElement == null || keyOrdering.compare(leftKey, 
matchKey) != 0) {
-stop = false
-rightMatches = null
+if (bufferedMatches == null || bufferedMatches.size == 0) {
+  // we just found a row with no join match and we are here to 
produce a row
+  // with this row and a standard null row from the other side.
+  if (continueStreamed) {
+val joinedRow = smartJoinRow(streamedElement, 
bufferedNullRow)
+fetchStreamed()
+joinedRow
+  } else {
+val joinedRow = smartJoinRow(streamedNullRow, 
bufferedElement)
+fetchBuffered()
+joinedRow
+  }
+} else {
+  // we are using the buffered right rows and run down left 
iterator
+  val joinedRow = smartJoinRow(streamedElement, 
bufferedMatches(bufferedPosition))
+  bufferedPosition += 1
+  if (bufferedPosition = bufferedMatches.size) {
+bufferedPosition = 0
+if (joinType != FullOuter || secondStreamedElement == 
null) {
+  fetchStreamed()
--- End diff --

I think we should use `boundCondition ` to update `bufferedMatches ` after 
we `fetchStreamed ()` .Otherwise we may get wrong answer.For example
```sql
table a(key int,value int);table b(key int,value int)
data of a
1 3
1 1
2 1
2 3

data of b
1 1
2 1
select a.key,b.key,a.value-b.value from a left outer join b on a.key=b.key 
and a.value - b.value  1
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [SQL] use sort merge join for out...

2015-06-21 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/5717#discussion_r32891271
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -90,13 +90,12 @@ private[sql] abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
left.statistics.sizeInBytes = 
sqlContext.conf.autoBroadcastJoinThreshold =
   makeBroadcastHashJoin(leftKeys, rightKeys, left, right, 
condition, joins.BuildLeft)
 
-  // If the sort merge join option is set, we want to use sort merge 
join prior to hashjoin
-  // for now let's support inner join first, then add outer join
-  case ExtractEquiJoinKeys(Inner, leftKeys, rightKeys, condition, 
left, right)
+  // If the sort merge join option is set, we want to use sort merge 
join prior to hashjoin.
+  // And for outer join, we can not put conditions outside of the join
+  case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, 
left, right)
 if sqlContext.conf.sortMergeJoinEnabled =
-val mergeJoin =
-  joins.SortMergeJoin(leftKeys, rightKeys, planLater(left), 
planLater(right))
-condition.map(Filter(_, mergeJoin)).getOrElse(mergeJoin) :: Nil
+joins.SortMergeJoin(
+  leftKeys, rightKeys, joinType, planLater(left), 
planLater(right), condition) :: Nil
 
--- End diff --

Shall we move the code to a new `Strategy `(like SortMergeJoin) instead of 
mix in   `Hashjoin`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [SQL] use sort merge join for out...

2015-06-21 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/5717#discussion_r32891298
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoin.scala
 ---
@@ -82,86 +130,169 @@ case class SortMergeJoin(
 
 override final def next(): InternalRow = {
   if (hasNext) {
-// we are using the buffered right rows and run down left 
iterator
-val joinedRow = joinRow(leftElement, 
rightMatches(rightPosition))
-rightPosition += 1
-if (rightPosition = rightMatches.size) {
-  rightPosition = 0
-  fetchLeft()
-  if (leftElement == null || keyOrdering.compare(leftKey, 
matchKey) != 0) {
-stop = false
-rightMatches = null
+if (bufferedMatches == null || bufferedMatches.size == 0) {
+  // we just found a row with no join match and we are here to 
produce a row
+  // with this row and a standard null row from the other side.
+  if (continueStreamed) {
+val joinedRow = smartJoinRow(streamedElement, 
bufferedNullRow)
+fetchStreamed()
+joinedRow
+  } else {
+val joinedRow = smartJoinRow(streamedNullRow, 
bufferedElement)
+fetchBuffered()
+joinedRow
+  }
+} else {
+  // we are using the buffered right rows and run down left 
iterator
+  val joinedRow = smartJoinRow(streamedElement, 
bufferedMatches(bufferedPosition))
+  bufferedPosition += 1
+  if (bufferedPosition = bufferedMatches.size) {
+bufferedPosition = 0
+if (joinType != FullOuter || secondStreamedElement == 
null) {
+  fetchStreamed()
+  if (streamedElement == null || 
keyOrdering.compare(streamedKey, matchKey) != 0) {
+stop = false
+bufferedMatches = null
+  }
+} else {
+  // in FullOuter join and the first time we finish the 
match buffer,
+  // we still want to generate all rows with streamed null 
row and buffered
+  // rows that match the join key but not the conditions.
+  streamedElement = secondStreamedElement
+  bufferedMatches = secondBufferedMatches
+  secondStreamedElement = null
+  secondBufferedMatches = null
+}
   }
+  joinedRow
 }
-joinedRow
   } else {
 // no more result
 throw new NoSuchElementException
   }
 }
 
-private def fetchLeft() = {
-  if (leftIter.hasNext) {
-leftElement = leftIter.next()
-leftKey = leftKeyGenerator(leftElement)
+private def smartJoinRow(streamedRow: InternalRow, bufferedRow: 
InternalRow): InternalRow =
+  joinType match {
+case RightOuter = joinRow(bufferedRow, streamedRow)
+case _ = joinRow(streamedRow, bufferedRow)
+  }
+
+private def fetchStreamed(): Unit = {
+  if (streamedIter.hasNext) {
+streamedElement = streamedIter.next()
+streamedKey = streamedKeyGenerator(streamedElement)
   } else {
-leftElement = null
+streamedElement = null
   }
 }
 
-private def fetchRight() = {
-  if (rightIter.hasNext) {
-rightElement = rightIter.next()
-rightKey = rightKeyGenerator(rightElement)
+private def fetchBuffered(): Unit = {
+  if (bufferedIter.hasNext) {
+bufferedElement = bufferedIter.next()
+bufferedKey = bufferedKeyGenerator(bufferedElement)
   } else {
-rightElement = null
+bufferedElement = null
   }
 }
 
 private def initialize() = {
-  fetchLeft()
-  fetchRight()
+  fetchStreamed()
+  fetchBuffered()
 }
 
 /**
  * Searches the right iterator for the next rows that have matches 
in left side, and store
  * them in a buffer.
+ * When this is not a Inner join, we will also return true when we 
get a row with no match
+ * on the other side. This search will jump out every time from 
the same position until
+ * `next()` is called

[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...

2015-06-16 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/6833#discussion_r32592438
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala ---
@@ -230,7 +230,15 @@ private[spark] class 
SparkHiveDynamicPartitionWriterContainer(
   val path = {
 val outputPath = FileOutputFormat.getOutputPath(conf.value)
--- End diff --

Oh,I try it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...

2015-06-16 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/6833#issuecomment-112636409
  
@chenghao-intel ,I think it only affect the dynamic partition.Because 
`SparkHadoopWriter` get the write by `OutputFormat.getRecordWriter`,most of 
them use the `FileOutputFormat.getTaskOutputPath` to get the path


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...

2015-06-15 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/6833

[SPARK-8379][SQL]avoid speculative tasks write to the same file

The issue link 
[SPARK-8379](https://issues.apache.org/jira/browse/SPARK-8379)
Currently,when we insert data to the dynamic partition with speculative 
tasks we will get the Exception
```

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 
Lease mismatch on 
/tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
 
owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 
but is accessed by 
DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
```
This pr try to write the data to temporary dir when using dynamic parition  
avoid the speculative tasks writing the same file

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark speculation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6833.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6833


commit e19a3bd77b6b9f44479e51659e244e9809b2963d
Author: jeanlyn jeanly...@gmail.com
Date:   2015-06-15T16:38:16Z

avoid speculative tasks write same file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...

2015-06-15 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/6833#discussion_r32492419
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -197,7 +197,6 @@ case class InsertIntoHiveTable(
   table.hiveQlTable.getPartCols().foreach { entry =
 orderedPartitionSpec.put(entry.getName, 
partitionSpec.get(entry.getName).getOrElse())
   }
-  val partVals = 
MetaStoreUtils.getPvals(table.hiveQlTable.getPartCols, partitionSpec)
 
--- End diff --

I think 
https://github.com/apache/spark/pull/5876/files#diff-d579db9a8f27e0bbef37720ab14ec3f6L203
 should remove this code. @marmbrus. Right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...

2015-06-15 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/6833#discussion_r32491951
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -197,7 +197,6 @@ case class InsertIntoHiveTable(
   table.hiveQlTable.getPartCols().foreach { entry =
 orderedPartitionSpec.put(entry.getName, 
partitionSpec.get(entry.getName).getOrElse())
   }
-  val partVals = 
MetaStoreUtils.getPvals(table.hiveQlTable.getPartCols, partitionSpec)
 
--- End diff --

This code seems never use,so remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...

2015-06-12 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/6682


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...

2015-06-07 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/6682#issuecomment-109777121
  
@yhuai ,Thanks for comment.In the current implementation of 
`join(BinaryNode)` in master just simply use the one side partitioning as its 
partitioning to judge whether need shuffle and ignore the other side 
partitioning which already partition.This may cause unnecessary shuffle on 
multiway join.For example:
```sql
table a(key string,value string)
table b(key string,value string)
table c(key string,value string)
table d(key string,value string)
table e(key string,value string)

select a.value,b.value,c.value,d.value,e.value from
a join b 
on a.key = b.key
join c
on a.key = c.key
join d
on b.key = d.key
join e
on c.key = e.key
```
we got
```
Project [value#63,value#65,value#67,value#69,value#71]
 ShuffledHashJoin [key#66], [key#70], BuildRight
  Exchange (HashPartitioning [key#66], 200)
   Project [value#63,key#66,value#67,value#65,value#69]
ShuffledHashJoin [key#64], [key#68], BuildRight
 Exchange (HashPartitioning [key#64], 200)
  Project [value#63,key#66,key#64,value#67,value#65]
   ShuffledHashJoin [key#62], [key#66], BuildRight
ShuffledHashJoin [key#62], [key#64], BuildRight
 Exchange (HashPartitioning [key#62], 200)
  HiveTableScan [key#62,value#63], (MetastoreRelation default, a, 
None), None
 Exchange (HashPartitioning [key#64], 200)
  HiveTableScan [key#64,value#65], (MetastoreRelation default, b, 
None), None
Exchange (HashPartitioning [key#66], 200)
 HiveTableScan [key#66,value#67], (MetastoreRelation default, c, 
None), None
 Exchange (HashPartitioning [key#68], 200)
  HiveTableScan [key#68,value#69], (MetastoreRelation default, d, None),
```
But actually
we just need
```
Project [value#59,value#61,value#63,value#65,value#67]
 ShuffledHashJoin [key#62], [key#66], BuildRight
  Project [value#63,value#61,value#65,value#59,key#62]
   ShuffledHashJoin [key#60], [key#64], BuildRight
Project [value#63,value#61,key#60,value#59,key#62]
 ShuffledHashJoin [key#58], [key#62], BuildRight
  ShuffledHashJoin [key#58], [key#60], BuildRight
   Exchange (HashPartitioning 200)
HiveTableScan [key#58,value#59], (MetastoreRelation default, a, 
None), None
   Exchange (HashPartitioning 200)
HiveTableScan [key#60,value#61], (MetastoreRelation default, b, 
None), None
  Exchange (HashPartitioning 200)
   HiveTableScan [key#62,value#63], (MetastoreRelation default, c, 
None), None
Exchange (HashPartitioning 200)
 HiveTableScan [key#64,value#65], (MetastoreRelation default, d, None), 
None
  Exchange (HashPartitioning 200)
   HiveTableScan [key#66,value#67], (MetastoreRelation default, e, None), 
None
```
This will greatly improve the efficiency,especially for the outer join. We 
had some real world cases of multiway full outer join with the same key,it 
produce a lot of null key(causing data skew,while hive doesn't) with redundancy 
shuffle and ran OOM finally.
I want to try using the `meetPartitions`,we can save the both side 
`outputPartitioning` of the `BinaryNode` and the itself 
`outputPartitioning`(redundancy?) when constructing the plan tree to achieve 
this easily,and the `meetPartitions` will be reset to the node 
outputPartitioning when need shuffle to avoid removing the indeed `Exchange`  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...

2015-06-07 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/6682#issuecomment-109871137
  
@yhuai .Yes,the full outer join cases shuffled the null key to the same 
reducer in spark-sql ,and the hive plan generated like:
```sql
explain select a.value,b.value,c.value,d.value from
a full outer join b 
on a.key = b.key
full outer join c
on a.key = c.key
full outer join d
on a.key = d.key


STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: a
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
Reduce Output Operator
  key expressions: key (type: string)
  sort order: +
  Map-reduce partition columns: key (type: string)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
  value expressions: value (type: string)
  TableScan
alias: b
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
Reduce Output Operator
  key expressions: key (type: string)
  sort order: +
  Map-reduce partition columns: key (type: string)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
  value expressions: value (type: string)
  TableScan
alias: c
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
Reduce Output Operator
  key expressions: key (type: string)
  sort order: +
  Map-reduce partition columns: key (type: string)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
  value expressions: value (type: string)
  TableScan
alias: d
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
Reduce Output Operator
  key expressions: key (type: string)
  sort order: +
  Map-reduce partition columns: key (type: string)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
  value expressions: value (type: string)
  Reduce Operator Tree:
Join Operator
  condition map:
   Outer Join 0 to 1
   Outer Join 0 to 2
   Outer Join 0 to 3
  keys:
0 key (type: string)
1 key (type: string)
2 key (type: string)
3 key (type: string)
  outputColumnNames: _col1, _col6, _col11, _col16
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
  Select Operator
expressions: _col1 (type: string), _col6 (type: string), _col11 
(type: string), _col16 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
```
@chenghao-intel has a solution in #6413


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...

2015-06-06 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/6682

[SPARK-2205][SPARK-7871][SQL]Advoid redundancy exchange

When only use the output partitioning of `BinaryNode` will probably add 
unnecessary `Exchange` like multiway join.
This PR add `meetPartitions ` to SparkPlan advoid redundancy  exchanges by  
use to save the partitioning of the node and the child,and will be reset to 
node partitioning when need shuffle.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark addMeetRequire

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6682.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6682


commit 7e52db3e6ff6e2dffe0c599730b8019319fda185
Author: jeanlyn jeanly...@gmail.com
Date:   2015-06-06T10:22:57Z

remove unnecessary exchange




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2205][SPARK-7871][SQL]Advoid redundancy...

2015-06-06 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/6682#issuecomment-109564743
  
cc @yhuai @chenghao-intel 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6908] [SQL] Use isolated Hive client

2015-06-01 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5876#issuecomment-107415482
  
I set the metastore to `0.12.0` by follow steps ,but get classnotdef 
exception:
* I chang the `spark.sql.hive.metastore.version` in `spark-defaults.conf` 
to `0.12.0`,i got
* set `spark.sql.hive.metastore.jars` to `maven`,or set 
`spark.sql.hive.metastore.jars` to classpath of hadoop and hive i got 
```
 java.lang.NoClassDefFoundError: com/google/common/base/Preconditions when 
creating Hive client using classpath
```
I am not sure i am understand correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6908] [SQL] Use isolated Hive client

2015-06-01 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5876#issuecomment-107459171
  
I can use the build in class(0.13.1) to connect `0.12.0` metastore 
correctly except some warn and error which does not effect running
```
5/06/01 21:20:09 WARN metastore.RetryingMetaStoreClient: MetaStoreClient 
lost connection. Attempting to reconnect.
org.apache.thrift.TApplicationException: Invalid method name: 
'get_functions'
   at 
org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
   at 
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
   at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_functions(ThriftHiveMetastore.java:2886)
   at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_functions(ThriftHiveMetastore.java:2872)
   at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunctions(HiveMetaStoreClient.java:1727)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
   at $Proxy10.getFunctions(Unknown Source)
   at 
org.apache.hadoop.hive.ql.metadata.Hive.getFunctions(Hive.java:2670)
   at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:674)
   at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:662)
   at 
org.apache.hadoop.hive.cli.CliDriver.getCommandCompletor(CliDriver.java:540)
   at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:174)
   at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/06/01 21:20:10 INFO hive.metastore: Trying to connect to metastore with 
URI thrift://172.19.154.28:9084
15/06/01 21:20:10 INFO hive.metastore: Connected to metastore.
15/06/01 21:20:10 ERROR exec.FunctionRegistry: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.TApplicationException: Invalid method name: 'get_functions'
spark-sql
```
Thanks 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8020] Spark SQL in spark-defaults.conf ...

2015-06-01 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/6563#discussion_r31490129
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala ---
@@ -37,6 +38,48 @@ class VersionsSuite extends SparkFunSuite with Logging {
   hive.metastore.warehouse.dir - warehousePath.toString)
   }
 
+  test(SPARK-8020: successfully create a HiveContext with metastore 
settings in Spark conf.) {
+val sparkConf =
+  new SparkConf() {
+// We are not really clone it. We need to keep the custom getAll.
+override def clone: SparkConf = this
+
+override def getAll: Array[(String, String)] = {
+  val allSettings = super.getAll
+  val metastoreVersion = get(spark.sql.hive.metastore.version)
+  val metastoreJars = get(spark.sql.hive.metastore.jars)
+
+  val others = allSettings.filterNot { case (key, _) =
+key == spark.sql.hive.metastore.version || key == 
spark.sql.hive.metastore.jars
+  }
+
+  // Put metastore.version to the first one. It is needed to 
trigger the exception
+  // caused by SPARK-8020. Other problems triggered by SPARK-8020
+  // (e.g. using Hive 0.13.1's metastore client to connect to the 
a 0.12 metastore)
+  // are not easy to test.
+  Array(
+(spark.sql.hive.metastore.version - metastoreVersion),
+(spark.sql.hive.metastore.jars - metastoreJars)) ++ others
+}
+  }
+sparkConf
+  .set(spark.sql.hive.metastore.version, 12)
--- End diff --

Does `12` equate to `0.12.0` or `1.2`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7885][SQL]add config to control map agg...

2015-05-30 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/6426#issuecomment-107059001
  
Thanks @rxin for informations.I close this PR for now and will reopen it 
once be optimized.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7885][SQL]add config to control map agg...

2015-05-30 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/6426


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7871] [SQL] Improve the outputPartition...

2015-05-27 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/6413#discussion_r31158220
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala ---
@@ -32,6 +32,26 @@ import org.apache.spark.sql.types._
 
 
 class PlannerSuite extends FunSuite {
+  test(multiway full outer join) {
+val planned = testData
+  .join(testData2, testData(key) === testData2(a), 
outer)
+  .join(testData3, testData(key) === testData3(a), 
outer)
+  .queryExecution.executedPlan
+val exchanges = planned.collect { case n: Exchange = n }
+
+assert(exchanges.size === 3)
--- End diff --

Is these changs doesn't effect to 
```scala
testData
  .join(testData2, testData(key) === testData2(a), outer)
  .join(testData2, testData(a) === testData3(a), outer)
```
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7885][SQL]add config to control map agg...

2015-05-26 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/6426

[SPARK-7885][SQL]add config to control map aggregation in spark sql

[SPARK-7885](https://issues.apache.org/jira/browse/SPARK-7885),we add 
`spark.sql.partialAggregation.enable`,it's true by default,we can set false to 
make map aggregation unable to avoid gc problem.For example,we run the sql
```sql
insert overwrite table groupbytest
select sale_ord_id as order_id,
  coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount,
  coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount,
  coalesce(sum(flash_gp_offer_amount),0.0) + 
coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount,
  coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount,
  coalesce(sum(full_minus_offer_amount),0.0) as 
full_rebate_offer_amount,
  0.0 as telecom_point_offer_amount,
  coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount,
  coalesce(sum(jq_pay_amount),0.0) + 
coalesce(sum(pop_shop_jq_pay_amount),0.0) + 
coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount,
  coalesce(sum(dq_pay_amount),0.0) + 
coalesce(sum(pop_shop_dq_pay_amount),0.0) + 
coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount,
  coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount ,
  coalesce(sum(mobile_red_packet_pay_amount),0.0) as 
mobile_red_packet_pay_amount,
  coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount,
  coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount,
  coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount,
  coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount,
  coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount
fromord_at_det_di
where   ds = '2015-05-20'
group  by   sale_ord_id
```
use 6 executor, each executor has 8GB memory and 2 cpu,we got gc problems 
during the map aggregation and finally the executor crash

![5869030a-d924-4249-9e1d-c637caa9363a](https://cloud.githubusercontent.com/assets/3426093/7828153/4afdaf88-0462-11e5-8af0-3bff04edab92.png)
 

When we set `spark.sql.partialAggregation.enable` false ,the sql run in 2 
min

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark partialAggregation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6426.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6426


commit b17c676bb0d33019bbdd124048221595f278b9d0
Author: jeanlyn jeanly...@gmail.com
Date:   2015-05-27T03:03:47Z

add config to control map aggregation in spark sql




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7885][SQL]add config to control map agg...

2015-05-26 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/6426#issuecomment-105759184
  
Thanks @JoshRosen @rxin for comment.I had meet at lease two group by cases  
of our production environment run long GC time and finally executor crash.These 
case has a comment characteristic: The input split is large(more than 
128MB),and the key is also large.But had no data skew.because when i turn 
`spark.sql.partialAggregation.enable` false,we can get the result quickly.And 
each reduce task handle the data sizes almost the same.
@rxin Do you had more informations about `refactor aggregate/UDAF 
interface`.Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2926][Shuffle]Add MR style sort-merge s...

2015-05-13 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/3438#discussion_r30295775
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleReader.scala ---
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.shuffle.sort
+
+import java.io.File
+import java.io.FileOutputStream
+import java.nio.ByteBuffer
+import java.util.Comparator
+
+import scala.collection.mutable.{ArrayBuffer, HashMap, Queue}
+import scala.util.{Failure, Success, Try}
+
+import org.apache.spark._
+import org.apache.spark.executor.ShuffleWriteMetrics
+import org.apache.spark.network.buffer.{ManagedBuffer, NioManagedBuffer}
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.shuffle.{BaseShuffleHandle, FetchFailedException, 
ShuffleReader}
+import org.apache.spark.storage._
+import org.apache.spark.util.{CompletionIterator, Utils}
+import org.apache.spark.util.collection.{MergeUtil, TieredDiskMerger}
+
+/**
+ * SortShuffleReader merges and aggregates shuffle data that has already 
been sorted within each
+ * map output block.
+ *
+ * As blocks are fetched, we store them in memory until we fail to acquire 
space from the
+ * ShuffleMemoryManager. When this occurs, we merge some in-memory blocks 
to disk and go back to
+ * fetching.
+ *
+ * TieredDiskMerger is responsible for managing the merged on-disk blocks 
and for supplying an
+ * iterator with their merged contents. The final iterator that is passed 
to user code merges this
+ * on-disk iterator with the in-memory blocks that have not yet been 
spilled.
+ */
+private[spark] class SortShuffleReader[K, C](
+handle: BaseShuffleHandle[K, _, C],
+startPartition: Int,
+endPartition: Int,
+context: TaskContext)
+  extends ShuffleReader[K, C] with Logging {
+
+  /** Manage the fetched in-memory shuffle block and related buffer */
+  case class MemoryShuffleBlock(blockId: BlockId, blockData: ManagedBuffer)
+
+  require(endPartition == startPartition + 1,
+Sort shuffle currently only supports fetching one partition)
+
+  private val dep = handle.dependency
+  private val conf = SparkEnv.get.conf
+  private val blockManager = SparkEnv.get.blockManager
+  private val ser = Serializer.getSerializer(dep.serializer)
+  private val shuffleMemoryManager = SparkEnv.get.shuffleMemoryManager
+
+  private val fileBufferSize = conf.getInt(spark.shuffle.file.buffer.kb, 
32) * 1024
+
+  /** Queue to store in-memory shuffle blocks */
+  private val inMemoryBlocks = new Queue[MemoryShuffleBlock]()
+
+  /**
+   * Maintain block manager and reported size of each shuffle block. The 
block manager is used for
+   * error reporting. The reported size, which, because of size 
compression, may be slightly
+   * different than the size of the actual fetched block, is used for 
calculating how many blocks
+   * to spill.
+   */
+  private val shuffleBlockMap = new HashMap[ShuffleBlockId, 
(BlockManagerId, Long)]()
+
+  /** keyComparator for mergeSort, id keyOrdering is not available,
+* using hashcode of key to compare */
+  private val keyComparator: Comparator[K] = dep.keyOrdering.getOrElse(new 
Comparator[K] {
+override def compare(a: K, b: K) = {
+  val h1 = if (a == null) 0 else a.hashCode()
+  val h2 = if (b == null) 0 else b.hashCode()
+  if (h1  h2) -1 else if (h1 == h2) 0 else 1
+}
+  })
+
+  /** A merge thread to merge on-disk blocks */
+  private val tieredMerger = new TieredDiskMerger(conf, dep, 
keyComparator, context)
+
+  /** Shuffle block fetcher iterator */
+  private var shuffleRawBlockFetcherItr: ShuffleRawBlockFetcherIterator = _
+
+  /** Number of bytes spilled in memory and on disk */
+  private var _memoryBytesSpilled: Long = 0L
+  private var _diskBytesSpilled: Long = 0L

[GitHub] spark pull request: [SQL][Minor] typo

2015-03-26 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5220#issuecomment-86787963
  
@chenghao-intel ,I think  #5198 had fixed the problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-23 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/5079


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-23 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-85085332
  
After communicated with @adrian-wang offline. I realized this PR still 
leave some class loader problem.So i close this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5794] [SQL] [WIP] fix add jar

2015-03-23 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/4586#discussion_r26954661
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -76,7 +77,8 @@ class HadoopTableReader(
   override def makeRDDForTable(hiveTable: HiveTable): RDD[Row] =
 makeRDDForTable(
   hiveTable,
-  
relation.tableDesc.getDeserializerClass.asInstanceOf[Class[Deserializer]],
+  Class.forName(relation.tableDesc.getSerdeClassName, true, 
sc.sparkContext.getClassLoader)
--- End diff --

Shall we also do this in `makeRDDForPartitionedTable `?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-19 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-83372419
  
@chenghao-intel my full code is
```java
import org.apache.hadoop.hive.ql.exec.UDF;

public class hello extends UDF {
public String evaluate(String str) {
try {
return hello  + str;
} catch (Exception e) {
return null;
}
}
}
```
@adrian-wang ,I also test the `mapjoin_addjar.q` in `spark-sql`.
I got the exception when `CREATE TABLE `
```
15/03/19 14:41:36 ERROR DDLTask: java.lang.NoSuchFieldError: CHAR
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:310)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:277)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
```
But it seems that not the load jar problem.Because when i not run the 
```
add jar 
${system:maven.local.repository}/org/apache/hive/hcatalog/hive-hcatalog-core/${system:hive.version}/hive-hcatalog-core-${system:hive.version}.jar;
```
I got the follow exception
```
15/03/19 14:54:51 ERROR DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: Cannot validate serde: 
org.apache.hive.hcatalog.data.JsonSerDe
at 
org.apache.hadoop.hive.ql.exec.DDLTask.validateSerDe(DDLTask.java:3423)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3553)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-19 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-83383402
  
I also don't have CHAR in `mapjoin_addjar.q`. I only find one 
`mapjoin_addjar.q`,and the path of my file is

sql/hive/src/test/resources/ql/src/test/queries/clientpositive/mapjoin_addjar.q
```sql
set hive.auto.convert.join=true;
set hive.auto.convert.join.use.nonstaged=false;

add jar 
${system:maven.local.repository}/org/apache/hive/hcatalog/hive-hcatalog-core/${system:hive.version}/hive-hcatalog-core-${system:hive.version}.jar;

CREATE TABLE t1 (a string, b string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
;
LOAD DATA LOCAL INPATH ../../data/files/sample.json INTO TABLE t1;
select * from src join t1 on src.key =t1.a;
drop table t1;
set hive.auto.convert.join=false;

```
May be we can discuss this offline?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-18 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-82908034
  
Updated, @liancheng @marmbrus I had tried to add a test for this 
patch,could you take a look for the test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-18 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-82976538
  
Ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...

2015-03-18 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-82926819
  
Hi, @marmbrus ,I had update the code as you mentioned about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-18 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-82970359
  
Thanks @liancheng for explain.You are right,it need consider more about 
it.So,should i remove the test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-18 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-83254407
  
@yhuai ,There is a simple functions
```java
public String evaluate(String str) {
try {
return hello  + str;
} catch (Exception e) {
return null;
}
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-18 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-83267913
  
@chenghao-intel I am not clear what problem #4586 try to fix.If #4586 try 
to fix the problem as I mentioned.I think we can reuse the 
`SparkContext.addJar` is enough to fix the class loading/unloading 
problem.Right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-18 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-83299775
  
@adrian-wang You mean not work in `spark-shell` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-18 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/5079#issuecomment-83348085
  
@adrian-wang ,I had tested in `spark-sql` ,and get result correctly with my 
test case. Can you provide your test case?By the way,when i debug this issue i 
found in the `thrifter-server` mode,it also reuse the `SparkContext.addJar`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5794] [SQL] [WIP] fix add jar

2015-03-17 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/4586#discussion_r26595976
  
--- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 ---
@@ -263,7 +263,8 @@ private[hive] class SparkSQLCLIDriver extends CliDriver 
with Logging {
   val proc: CommandProcessor = 
HiveShim.getCommandProcessor(Array(tokens(0)), hconf)
 
   if (proc != null) {
-if (proc.isInstanceOf[Driver] || proc.isInstanceOf[SetProcessor]) {
+if (proc.isInstanceOf[Driver] || proc.isInstanceOf[SetProcessor] ||
+  proc.isInstanceOf[AddResourceProcessor]) {
--- End diff --

agree


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6392][SQL]Minor fix ClassNotFound excep...

2015-03-17 Thread jeanlyn
GitHub user jeanlyn opened a pull request:

https://github.com/apache/spark/pull/5079

[SPARK-6392][SQL]Minor fix ClassNotFound exception when use spark cli to 
add jar 

When we use spark cli to add jar dynamic,we will get the 
`java.lang.ClassNotFoundException` when we use the class of jar to create 
udf.For example:
```sql
spark-sql add jar /home/jeanlyn/hello.jar;
spark-sqlcreate temporary function hello as 'hello';
spark-sqlselect hello(name) from person;
Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): 
java.lang.ClassNotFoundException: hello
```
we can use the spark physical plan to fix this problem

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeanlyn/spark SPARK-6392

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5079


commit ca958493957e7cc5f49dc318bcbf213793a991c0
Author: jeanlyn jeanly...@gmail.com
Date:   2015-03-18T02:02:35Z

use spark physical plan when add jar




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...

2015-03-17 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-82383269
  
Updated, @marmbrus @chenghao-intel . We had tested this patch in our 
environment over the past few days.Any more problems in this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...

2015-02-20 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/3907#issuecomment-75260761
  
OK.I close this one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...

2015-02-20 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/3891#issuecomment-75260856
  
OK.I close this one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...

2015-02-20 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/3891


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5068][SQL]fix bug query data when path ...

2015-02-20 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/3907


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...

2015-02-11 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-73868381
  
@chenghao-intel ,I had pass all unit test in my local .But i think the unit 
test of thrift-server seems unstable,it's depend on the state of the 
machine,when the machine is busy,it may time out during the unit test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...

2015-02-11 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-73877412
  
/cc @marmbrus


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...

2015-02-10 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-73837462
  
Hi,@marmbrus , @chenghao-intel I have no idea why `SPARK-4407 regression: 
Complex type support` this test failed after i resolved the merge conflicts.It 
seems that not my problems,because i had passed this unit tests before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix query exception when part...

2015-02-10 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-73836885
  
Retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-08 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-73459718
  
Thanks @chenghao-intel for review and suggestions!I take some of your 
advises  to simplify the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-05 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-73176093
  
hi,@chenghao-intel @marmbrus any suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5068] [SQL] Fix bug query data when pat...

2015-02-04 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/4356#discussion_r24067924
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -207,13 +219,21 @@ class HadoopTableReader(
* If `filterOpt` is defined, then it will be used to filter files from 
`path`. These files are
* returned in a single, comma-separated string.
*/
-  private def applyFilterIfNeeded(path: Path, filterOpt: 
Option[PathFilter]): String = {
-filterOpt match {
-  case Some(filter) =
-val fs = path.getFileSystem(sc.hiveconf)
-val filteredFiles = fs.listStatus(path, 
filter).map(_.getPath.toString)
-filteredFiles.mkString(,)
-  case None = path.toString
+  private def applyFilterIfNeeded(path: Path, filterOpt: 
Option[PathFilter]): Option[String] = {
+if (fs.exists(path)) {
--- End diff --

I think we'd better get `fs` from the `path`,because in the `hadoop 
namenode federation` we may get some problems like ` Wrong FS` exception if we  
use the `FileSystem.get(sc.hiveconf)`  to get fs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-04 Thread jeanlyn
Github user jeanlyn commented on a diff in the pull request:

https://github.com/apache/spark/pull/4289#discussion_r24138028
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -315,9 +335,23 @@ private[hive] object HadoopTableReader extends 
HiveInspectors {
   }
 }
 
+val partTblObjectInspectorConverter = 
ObjectInspectorConverters.getConverter(
+  deserializer.getObjectInspector, soi)
+
 // Map each tuple to a row object
 iterator.map { value =
-  val raw = deserializer.deserialize(value)
+  val raw = convertdeserializer match {
--- End diff --

Thanks for remind


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >