[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030224#comment-13030224
 ] 

Hudson commented on HBASE-3721:
---

Integrated in HBase-TRUNK #1909 (See 
[https://builds.apache.org/hudson/job/HBase-TRUNK/1909/])


> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Fix For: 0.92.0
>
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721-v6.patch, 
> 3721.txt, LoadIncrementalHFiles.java
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.
> New configuration parameter "hbase.loadincremental.threads.max" is introduced 
> which sets the maximum number of threads for parallel bulk load.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-05 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029732#comment-13029732
 ] 

Ted Yu commented on HBASE-3721:
---

Version 6 is what Adam used.

> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721-v6.patch, 
> 3721.txt, LoadIncrementalHFiles.java
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.
> New configuration parameter "hbase.loadincremental.threads.max" is introduced 
> which sets the maximum number of threads for parallel bulk load.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-05 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029728#comment-13029728
 ] 

stack commented on HBASE-3721:
--

@Ted v6 patch is what he tested?  Is that what I should commit?  Thanks.

> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721-v6.patch, 
> 3721.txt, LoadIncrementalHFiles.java
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.
> New configuration parameter "hbase.loadincremental.threads.max" is introduced 
> which sets the maximum number of threads for parallel bulk load.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-05 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029694#comment-13029694
 ] 

Ted Yu commented on HBASE-3721:
---

>From Adam:
I did a number of runs of loading a single set of HFiles with and without the 
patch, and it does seem the patch improves the load speed. I'll need to run 
more extensively to get accurate numbers, but with the patch I'm seeing ranges 
from 3-7 minutes vs 5-11 without the patch.

> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721-v6.patch, 
> 3721.txt, LoadIncrementalHFiles.java
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.
> New configuration parameter "hbase.loadincremental.threads.max" is introduced 
> which sets the maximum number of threads for parallel bulk load.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028608#comment-13028608
 ] 

stack commented on HBASE-3721:
--

I asked him in private mail.  Thanks Ted.

> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-03 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028591#comment-13028591
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--



bq.  On 2011-05-03 21:51:39, Michael Stack wrote:
bq.  > Does it work?  If it does, I'm good w/ applying it.  There are some 
questions in the below.  See what you think Ted.
bq.  
bq.  Ted Yu wrote:
bq.  I ran unit tests (TestHFileOutputFormat and TestLoadIncrementalHFiles) 
on my patch.
bq.  
bq.  Michael Stack wrote:
bq.  I was more asking if you'd loaded up any files with.  Do you think we 
should get someone like Adam Portley to try it?  Will I ask them?  They are the 
ones who wanted this originally complaining about slow load speed, is that 
right?  Good on you Ted.

I hope Adam would try out the latest patch.


- Ted


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/#review639
---


On 2011-05-03 22:28:11, Ted Yu wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/572/
bq.  ---
bq.  
bq.  (Updated 2011-05-03 22:28:11)
bq.  
bq.  
bq.  Review request for hbase and Todd Lipcon.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
bq.  I added the following method to HConnection/HConnectionManager:
bq.  public  void getRegionServerWithRetries(ExecutorService pool,
bq.  List> callables, Object[] results)
bq.  This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).
bq.  
bq.  I introduced two new config parameters: hbase.loadincremental.threads.max 
and hbase.loadincremental.batch.size
bq.  hbase.loadincremental.batch.size is for configuring the batch size above 
which HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
bq.  hbase.loadincremental.threads.max controls the maximum number of threads 
in thread pool.
bq.  
bq.  
bq.  This addresses bug HBASE-3721.
bq.  https://issues.apache.org/jira/browse/HBASE-3721
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1099118 
bq.  
bq.  Diff: https://reviews.apache.org/r/572/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestLoadIncrementalHFiles and TestHFileOutputFormat pass.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ted
bq.  
bq.



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-03 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028587#comment-13028587
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--



bq.  On 2011-05-03 21:51:39, Michael Stack wrote:
bq.  > Does it work?  If it does, I'm good w/ applying it.  There are some 
questions in the below.  See what you think Ted.
bq.  
bq.  Ted Yu wrote:
bq.  I ran unit tests (TestHFileOutputFormat and TestLoadIncrementalHFiles) 
on my patch.

I was more asking if you'd loaded up any files with.  Do you think we should 
get someone like Adam Portley to try it?  Will I ask them?  They are the ones 
who wanted this originally complaining about slow load speed, is that right?  
Good on you Ted.


- Michael


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/#review639
---


On 2011-05-03 22:28:11, Ted Yu wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/572/
bq.  ---
bq.  
bq.  (Updated 2011-05-03 22:28:11)
bq.  
bq.  
bq.  Review request for hbase and Todd Lipcon.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
bq.  I added the following method to HConnection/HConnectionManager:
bq.  public  void getRegionServerWithRetries(ExecutorService pool,
bq.  List> callables, Object[] results)
bq.  This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).
bq.  
bq.  I introduced two new config parameters: hbase.loadincremental.threads.max 
and hbase.loadincremental.batch.size
bq.  hbase.loadincremental.batch.size is for configuring the batch size above 
which HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
bq.  hbase.loadincremental.threads.max controls the maximum number of threads 
in thread pool.
bq.  
bq.  
bq.  This addresses bug HBASE-3721.
bq.  https://issues.apache.org/jira/browse/HBASE-3721
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1099118 
bq.  
bq.  Diff: https://reviews.apache.org/r/572/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestLoadIncrementalHFiles and TestHFileOutputFormat pass.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ted
bq.  
bq.



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-03 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028586#comment-13028586
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--



bq.  On 2011-05-03 21:51:39, Michael Stack wrote:
bq.  > 
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, 
line 212
bq.  > 
bq.  >
bq.  > Nothing is done w/ the result here.  Should it be logged or 
something?
bq.  
bq.  Ted Yu wrote:
bq.  The return type is Void.
bq.  I do log errors.

OK.  Makes sense.


- Michael


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/#review639
---


On 2011-05-03 22:28:11, Ted Yu wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/572/
bq.  ---
bq.  
bq.  (Updated 2011-05-03 22:28:11)
bq.  
bq.  
bq.  Review request for hbase and Todd Lipcon.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
bq.  I added the following method to HConnection/HConnectionManager:
bq.  public  void getRegionServerWithRetries(ExecutorService pool,
bq.  List> callables, Object[] results)
bq.  This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).
bq.  
bq.  I introduced two new config parameters: hbase.loadincremental.threads.max 
and hbase.loadincremental.batch.size
bq.  hbase.loadincremental.batch.size is for configuring the batch size above 
which HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
bq.  hbase.loadincremental.threads.max controls the maximum number of threads 
in thread pool.
bq.  
bq.  
bq.  This addresses bug HBASE-3721.
bq.  https://issues.apache.org/jira/browse/HBASE-3721
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1099118 
bq.  
bq.  Diff: https://reviews.apache.org/r/572/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestLoadIncrementalHFiles and TestHFileOutputFormat pass.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ted
bq.  
bq.



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-03 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028479#comment-13028479
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/
---

(Updated 2011-05-03 22:28:11.346203)


Review request for hbase and Todd Lipcon.


Changes
---

Cleaned up white spaces.


Summary
---

I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
I added the following method to HConnection/HConnectionManager:
public  void getRegionServerWithRetries(ExecutorService pool,
List> callables, Object[] results)
This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).

I introduced two new config parameters: hbase.loadincremental.threads.max and 
hbase.loadincremental.batch.size
hbase.loadincremental.batch.size is for configuring the batch size above which 
HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
hbase.loadincremental.threads.max controls the maximum number of threads in 
thread pool.


This addresses bug HBASE-3721.
https://issues.apache.org/jira/browse/HBASE-3721


Diffs (updated)
-

  /src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1099118 

Diff: https://reviews.apache.org/r/572/diff


Testing
---

TestLoadIncrementalHFiles and TestHFileOutputFormat pass.


Thanks,

Ted



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-03 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028475#comment-13028475
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--



bq.  On 2011-05-03 21:51:39, Michael Stack wrote:
bq.  > Does it work?  If it does, I'm good w/ applying it.  There are some 
questions in the below.  See what you think Ted.

I ran unit tests (TestHFileOutputFormat and TestLoadIncrementalHFiles) on my 
patch.


bq.  On 2011-05-03 21:51:39, Michael Stack wrote:
bq.  > 
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, 
line 212
bq.  > 
bq.  >
bq.  > Nothing is done w/ the result here.  Should it be logged or 
something?

The return type is Void.
I do log errors.


bq.  On 2011-05-03 21:51:39, Michael Stack wrote:
bq.  > 
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, 
line 233
bq.  > 
bq.  >
bq.  > There are a bunch of these in this patch... white space.

Will remove white spaces in next patch.


bq.  On 2011-05-03 21:51:39, Michael Stack wrote:
bq.  > 
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, 
line 235
bq.  > 
bq.  >
bq.  > Will multiple threads be trying to get a unique name at the same 
time?  Is this a good enough 'unique' name -- table name and incrementing 
number?  Is this per unique table-based name to isolate thread writes to the fs?

I changed regionCount to AtomicLong.
The unique name is to isolate writes to fs from different threads.


- Ted


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/#review639
---


On 2011-04-29 20:48:41, Ted Yu wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/572/
bq.  ---
bq.  
bq.  (Updated 2011-04-29 20:48:41)
bq.  
bq.  
bq.  Review request for hbase and Todd Lipcon.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
bq.  I added the following method to HConnection/HConnectionManager:
bq.  public  void getRegionServerWithRetries(ExecutorService pool,
bq.  List> callables, Object[] results)
bq.  This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).
bq.  
bq.  I introduced two new config parameters: hbase.loadincremental.threads.max 
and hbase.loadincremental.batch.size
bq.  hbase.loadincremental.batch.size is for configuring the batch size above 
which HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
bq.  hbase.loadincremental.threads.max controls the maximum number of threads 
in thread pool.
bq.  
bq.  
bq.  This addresses bug HBASE-3721.
bq.  https://issues.apache.org/jira/browse/HBASE-3721
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1097897 
bq.  
bq.  Diff: https://reviews.apache.org/r/572/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestLoadIncrementalHFiles and TestHFileOutputFormat pass.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ted
bq.  
bq.



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by J

[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-05-03 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028458#comment-13028458
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/#review639
---


Does it work?  If it does, I'm good w/ applying it.  There are some questions 
in the below.  See what you think Ted.


/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java


Nothing is done w/ the result here.  Should it be logged or something?



/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java


There are a bunch of these in this patch... white space.



/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java


Will multiple threads be trying to get a unique name at the same time?  Is 
this a good enough 'unique' name -- table name and incrementing number?  Is 
this per unique table-based name to isolate thread writes to the fs?


- Michael


On 2011-04-29 20:48:41, Ted Yu wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/572/
bq.  ---
bq.  
bq.  (Updated 2011-04-29 20:48:41)
bq.  
bq.  
bq.  Review request for hbase and Todd Lipcon.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
bq.  I added the following method to HConnection/HConnectionManager:
bq.  public  void getRegionServerWithRetries(ExecutorService pool,
bq.  List> callables, Object[] results)
bq.  This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).
bq.  
bq.  I introduced two new config parameters: hbase.loadincremental.threads.max 
and hbase.loadincremental.batch.size
bq.  hbase.loadincremental.batch.size is for configuring the batch size above 
which HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
bq.  hbase.loadincremental.threads.max controls the maximum number of threads 
in thread pool.
bq.  
bq.  
bq.  This addresses bug HBASE-3721.
bq.  https://issues.apache.org/jira/browse/HBASE-3721
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1097897 
bq.  
bq.  Diff: https://reviews.apache.org/r/572/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestLoadIncrementalHFiles and TestHFileOutputFormat pass.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ted
bq.  
bq.



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-04-29 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027178#comment-13027178
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/
---

(Updated 2011-04-29 20:48:41.082584)


Review request for hbase and Todd Lipcon.


Changes
---

Simplified the changes for this JIRA according to Todd's review.
TestLoadIncrementalHFiles and TestHFileOutputFormat pass.


Summary
---

I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
I added the following method to HConnection/HConnectionManager:
public  void getRegionServerWithRetries(ExecutorService pool,
List> callables, Object[] results)
This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).

I introduced two new config parameters: hbase.loadincremental.threads.max and 
hbase.loadincremental.batch.size
hbase.loadincremental.batch.size is for configuring the batch size above which 
HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
hbase.loadincremental.threads.max controls the maximum number of threads in 
thread pool.


This addresses bug HBASE-3721.
https://issues.apache.org/jira/browse/HBASE-3721


Diffs (updated)
-

  /src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1097897 

Diff: https://reviews.apache.org/r/572/diff


Testing
---

TestLoadIncrementalHFiles and TestHFileOutputFormat pass.


Thanks,

Ted



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-04-28 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026854#comment-13026854
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/#review612
---



/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java


Another way is to keep submitting Callable's in doBulkLoad() and save its 
Future.
When there is no more Callable to submit, we call Future.get() for each 
saved Future.


- Ted


On 2011-04-09 14:00:23, Ted Yu wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/572/
bq.  ---
bq.  
bq.  (Updated 2011-04-09 14:00:23)
bq.  
bq.  
bq.  Review request for hbase and Todd Lipcon.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
bq.  I added the following method to HConnection/HConnectionManager:
bq.  public  void getRegionServerWithRetries(ExecutorService pool,
bq.  List> callables, Object[] results)
bq.  This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).
bq.  
bq.  I introduced two new config parameters: hbase.loadincremental.threads.max 
and hbase.loadincremental.batch.size
bq.  hbase.loadincremental.batch.size is for configuring the batch size above 
which HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
bq.  hbase.loadincremental.threads.max controls the maximum number of threads 
in thread pool.
bq.  
bq.  
bq.  This addresses bug HBASE-3721.
bq.  https://issues.apache.org/jira/browse/HBASE-3721
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq./src/main/java/org/apache/hadoop/hbase/client/HConnection.java 1090500 
bq./src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java 
1090500 
bq./src/main/java/org/apache/hadoop/hbase/client/HTable.java 1090500 
bq.
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1090500 
bq.  
bq.  Diff: https://reviews.apache.org/r/572/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestLoadIncrementalHFiles and TestHFileOutputFormat pass.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ted
bq.  
bq.



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-04-28 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026849#comment-13026849
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/#review611
---



/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java


Your suggestion is good.
Looking at existing usage of ExecutorService in SplitTransaction and 
HConnectionManager, we submit all Callable's and then wait for them to complete.
I will try to use some data structure, such as Map, so that 
producer and consumer work concurrently.


- Ted


On 2011-04-09 14:00:23, Ted Yu wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/572/
bq.  ---
bq.  
bq.  (Updated 2011-04-09 14:00:23)
bq.  
bq.  
bq.  Review request for hbase and Todd Lipcon.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
bq.  I added the following method to HConnection/HConnectionManager:
bq.  public  void getRegionServerWithRetries(ExecutorService pool,
bq.  List> callables, Object[] results)
bq.  This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).
bq.  
bq.  I introduced two new config parameters: hbase.loadincremental.threads.max 
and hbase.loadincremental.batch.size
bq.  hbase.loadincremental.batch.size is for configuring the batch size above 
which HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
bq.  hbase.loadincremental.threads.max controls the maximum number of threads 
in thread pool.
bq.  
bq.  
bq.  This addresses bug HBASE-3721.
bq.  https://issues.apache.org/jira/browse/HBASE-3721
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq./src/main/java/org/apache/hadoop/hbase/client/HConnection.java 1090500 
bq./src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java 
1090500 
bq./src/main/java/org/apache/hadoop/hbase/client/HTable.java 1090500 
bq.
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1090500 
bq.  
bq.  Diff: https://reviews.apache.org/r/572/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestLoadIncrementalHFiles and TestHFileOutputFormat pass.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ted
bq.  
bq.



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism on multi-core 
> computer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-04-27 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026146#comment-13026146
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/#review597
---



/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java


I don't necessarily see the point of this being in HConnectionManager, 
since it's just a general submission of a bunch of callables.

If HConnectionManager did something smarter so that threads sleeping for 
retry didn't lock up a thread to sleep, it would make more sense here.



/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java


since these arrays only get filled in in the case that there are 
exceptions, we can expect they'll be very small in general, and the extra code 
to track actionCount isn't likely to make any difference. It's just a 
micro-optimization for something that isn't a bottleneck.



/src/main/java/org/apache/hadoop/hbase/client/HTable.java


why is this public?



/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java


rather than working in batches, why not just have one thread which is 
submitting tasks to the executor, and another thread which is pulling off 
completed ones? ie I don't see the point of batch size at all instead of 
something more "fluid" - a normal producer/consumer kind of design.



/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java


i don't like this dependency. Check out Guava's ThreadFactoryBuilder.


- Todd


On 2011-04-09 14:00:23, Ted Yu wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/572/
bq.  ---
bq.  
bq.  (Updated 2011-04-09 14:00:23)
bq.  
bq.  
bq.  Review request for hbase and Todd Lipcon.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
bq.  I added the following method to HConnection/HConnectionManager:
bq.  public  void getRegionServerWithRetries(ExecutorService pool,
bq.  List> callables, Object[] results)
bq.  This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).
bq.  
bq.  I introduced two new config parameters: hbase.loadincremental.threads.max 
and hbase.loadincremental.batch.size
bq.  hbase.loadincremental.batch.size is for configuring the batch size above 
which HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
bq.  hbase.loadincremental.threads.max controls the maximum number of threads 
in thread pool.
bq.  
bq.  
bq.  This addresses bug HBASE-3721.
bq.  https://issues.apache.org/jira/browse/HBASE-3721
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq./src/main/java/org/apache/hadoop/hbase/client/HConnection.java 1090500 
bq./src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java 
1090500 
bq./src/main/java/org/apache/hadoop/hbase/client/HTable.java 1090500 
bq.
/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1090500 
bq.  
bq.  Diff: https://reviews.apache.org/r/572/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  TestLoadIncrementalHFiles and TestHFileOutputFormat pass.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ted
bq.  
bq.



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period

[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-04-09 Thread jirapos...@reviews.apache.org (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017867#comment-13017867
 ] 

jirapos...@reviews.apache.org commented on HBASE-3721:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/572/
---

Review request for hbase and Todd Lipcon.


Summary
---

I refactored LoadIncrementalHFiles so that tryLoad() queues work items in 
List>. doBulkLoad() periodically sends batch of 
ServerCallable's to HBase cluster.
I added the following method to HConnection/HConnectionManager:
public  void getRegionServerWithRetries(ExecutorService pool,
List> callables, Object[] results)
This method uses thread pool to send multiple ServerCallable's through 
getRegionServerWithRetries(ServerCallable callable).

I introduced two new config parameters: hbase.loadincremental.threads.max and 
hbase.loadincremental.batch.size
hbase.loadincremental.batch.size is for configuring the batch size above which 
HConnection.getRegionServerWithRetries() would be called. In Adam's case, 
there're many small HFiles. LoadIncrementalHFiles shouldn't wait until all 
HFiles have been scanned.
hbase.loadincremental.threads.max controls the maximum number of threads in 
thread pool.


This addresses bug HBASE-3721.
https://issues.apache.org/jira/briwse/HBASE-3721


Diffs
-

  /src/main/java/org/apache/hadoop/hbase/client/HConnection.java 1090500 
  /src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java 1090500 
  /src/main/java/org/apache/hadoop/hbase/client/HTable.java 1090500 
  /src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java 
1090500 

Diff: https://reviews.apache.org/r/572/diff


Testing
---

TestLoadIncrementalHFiles and TestHFileOutputFormat pass.


Thanks,

Ted



> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>Assignee: Ted Yu
> Attachments: 3721-v2.txt, 3721-v3.txt, 3721-v4.txt, 3721.txt
>
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-04-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015041#comment-13015041
 ] 

Ted Yu commented on HBASE-3721:
---

LoadIncrementalHFiles may split StoreFile. The above proposal only works if 
there is no such splitting.

> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3721) Speedup LoadIncrementalHFiles

2011-04-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015035#comment-13015035
 ] 

Ted Yu commented on HBASE-3721:
---

To achieve parallelism, we need to add new method in HConnection:
{code}
  public  T getRegionServerWithRetries(ExecutorService pool,
List> callables, Object[] results)
  throws IOException, RuntimeException;
{code}
Its implementation would use wrapper objects for callables which keep track of 
the number of retries carried out so far.

LoadIncrementalHFiles would compose ServerCallables and pass them to the above 
method.

> Speedup LoadIncrementalHFiles
> -
>
> Key: HBASE-3721
> URL: https://issues.apache.org/jira/browse/HBASE-3721
> Project: HBase
>  Issue Type: Improvement
>  Components: util
>Reporter: Ted Yu
>
> From Adam Phelps:
> from the logs it looks like <1% of the hfiles we're loading have to be split. 
>  Looking at the code for LoadIncrementHFiles (hbase v0.90.1), I'm actually 
> thinking our problem is that this code loads the hfiles sequentially.  Our 
> largest table has over 2500 regions and the data being loaded is fairly well 
> distributed across them, so there end up being around 2500 HFiles for each 
> load period.  At 1-2 seconds per HFile that means the loading process is very 
> time consuming.
> Currently server.bulkLoadHFile() is a blocking call.
> We can utilize ExecutorService to achieve better parallelism.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira