[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-06 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181536#comment-13181536
 ] 

Zhihong Yu commented on HBASE-5140:
---

We should consider the amount of computing involved in the map/reduce tasks.
The assumption expressed in the description may not be satisfied in various 
scenarios.

I think we can provide abstraction over key space partitioning by introducing 
an interface.
The BigDecimal idea would be one implementation.

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Reporter: Josh Wymer
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-06 Thread Josh Wymer (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181550#comment-13181550
 ] 

Josh Wymer commented on HBASE-5140:
---

We also talked about other methods such as using the first 8 bytes of the keys 
and converting to a long. This could indeed be solved by an interface.

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Reporter: Josh Wymer
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-06 Thread Josh Wymer (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181565#comment-13181565
 ] 

Josh Wymer commented on HBASE-5140:
---

One glaring issue is the lack of start & end keys for one region tables. To get 
the start key we could do a quick scan of the first row and get the key. For 
the last region of a table, I'm not sure how we'll handle determining the end 
key other than setting it to the max size of whatever data type (e.g. long) we 
are using for the split calculations. Any suggestions other than this?

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Reporter: Josh Wymer
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-06 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181572#comment-13181572
 ] 

Zhihong Yu commented on HBASE-5140:
---

I suggest utilizing this method in HTable:
{code}
  public Pair getStartEndKeys() throws IOException {
{code}
i.e. start and end keys are passed to the splitter interface.

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Reporter: Josh Wymer
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-06 Thread Josh Wymer (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181579#comment-13181579
 ] 

Josh Wymer commented on HBASE-5140:
---

Correct but for example on a table with one region, getStartEndKeys() returns 
two empty byte[]. The last region (or only region) for the table will return 
empty byte[] as the end key allowing the scan to scan to the end of the table. 
Therefore, we don't know the upper bound byte[] to use in order to determine 
the long (or int, etc) value we want to use for split calculations. So we must 
either have an efficient way to get the last key in this case or arbitrarily 
set the long to it's max value (since in any case nothing could be higher) and 
use that number to make the calculations. This obviously won't work for unbound 
data types like BigDecimal and is a partial solution at best.

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Reporter: Josh Wymer
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-06 Thread Ming Ma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181582#comment-13181582
 ] 

Ming Ma commented on HBASE-5140:


Is it the same as HBASE-4063?

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Reporter: Josh Wymer
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-06 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181604#comment-13181604
 ] 

Zhihong Yu commented on HBASE-5140:
---

MAPREDUCE-1220, referenced in HBASE-4063, has been resolved against hadoop 0.23.
So we cannot use it at the moment.

@Josh:
I believe the single region scenario is the degenerate case.
Using max value for long should be fine for that case.
The best practice is to presplit when creating the table.

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Reporter: Josh Wymer
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-09 Thread Jean-Daniel Cryans (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182876#comment-13182876
 ] 

Jean-Daniel Cryans commented on HBASE-5140:
---

bq. The best practice is to presplit when creating the table.

I think this jira is valid for cases where the regions are so big (GBs) that 
one would benefit from having multiple mappers per region. 

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Reporter: Josh Wymer
>Priority: Trivial
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-09 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182930#comment-13182930
 ] 

Hadoop QA commented on HBASE-5140:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12509974/Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/711//console

This message is automatically generated.

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Affects Versions: 0.90.4
>Reporter: Josh Wymer
>Priority: Trivial
>  Labels: mapreduce, split
> Fix For: 0.90.4
>
> Attachments: 
> Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-09 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183087#comment-13183087
 ] 

Hadoop QA commented on HBASE-5140:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12510006/Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -151 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 80 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
  org.apache.hadoop.hbase.mapreduce.TestTableMapReduce

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/712//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/712//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/712//console

This message is automatically generated.

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Affects Versions: 0.90.4
>Reporter: Josh Wymer
>Priority: Trivial
>  Labels: mapreduce, split
> Fix For: 0.90.4
>
> Attachments: 
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch,
>  Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-01-09 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183119#comment-13183119
 ] 

Hadoop QA commented on HBASE-5140:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12510012/Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch.1
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -151 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 80 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/713//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/713//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/713//console

This message is automatically generated.

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Affects Versions: 0.90.4
>Reporter: Josh Wymer
>Priority: Trivial
>  Labels: mapreduce, split
> Fix For: 0.90.4
>
> Attachments: 
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch,
>  
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch.1,
>  Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a subclass for the TableInputFormat class that overrides 
> getSplits in order to generate N number of splits per regions and/or N number 
> of splits per job. The idea is to convert the startKey and endKey for each 
> region from byte[] to BigDecimal, take the difference, divide by N, convert 
> back to byte[] and generate splits on the resulting values. Assuming your 
> keys are fully distributed this should generate splits at nearly the same 
> number of rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2012-03-01 Thread Rajesh Balamohan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220725#comment-13220725
 ] 

Rajesh Balamohan commented on HBASE-5140:
-

@Josh - Thanks for this patch.

for loop within getSplits() generates the splits with the help of 
generateRegionSplits(). However, the returned list is not added 
back to "List splits = new 
ArrayList(keys.getFirst().length);"



> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Affects Versions: 0.90.4
>Reporter: Josh Wymer
>Priority: Trivial
>  Labels: mapreduce, split
> Fix For: 0.90.4
>
> Attachments: 
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch,
>  
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch.1,
>  Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a patch for the TableInputFormat class that overrides getSplits 
> in order to generate N number of splits per regions and/or N number of splits 
> per job. The idea is to convert the startKey and endKey for each region from 
> byte[] to BigDecimal, take the difference, divide by N, convert back to 
> byte[] and generate splits on the resulting values. Assuming your keys are 
> fully distributed this should generate splits at nearly the same number of 
> rows per split. Any suggestions on this issue are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2014-06-10 Thread David Koch (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026369#comment-14026369
 ] 

David Koch commented on HBASE-5140:
---

{quote}
Stale issue. Reopen if still relevant.
{quote}

Why is this deemed irrelevant? Is there new functionality in recent HBase 
versions which supersedes this class? By the way, in method 
{{getMaxByteArrayValue}} the array value assignment should read:

{code}
bytes[i] = (byte) 0xff;
{code}

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Affects Versions: 0.90.4
>Reporter: Josh Wymer
>Priority: Trivial
>  Labels: mapreduce, split
> Attachments: 
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch,
>  
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch.1,
>  Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a patch for the TableInputFormat class that overrides getSplits 
> in order to generate N number of splits per regions and/or N number of splits 
> per job. The idea is to convert the startKey and endKey for each region from 
> byte[] to BigDecimal, take the difference, divide by N, convert back to 
> byte[] and generate splits on the resulting values. Assuming your keys are 
> fully distributed this should generate splits at nearly the same number of 
> rows per split. Any suggestions on this issue are welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-5140) TableInputFormat subclass to allow N number of splits per region during MR jobs

2014-06-11 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028806#comment-14028806
 ] 

Andrew Purtell commented on HBASE-5140:
---

bq. Why is this deemed irrelevant?

No activity since March 2012. 

> TableInputFormat subclass to allow N number of splits per region during MR 
> jobs
> ---
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
>  Issue Type: New Feature
>  Components: mapreduce
>Affects Versions: 0.90.4
>Reporter: Josh Wymer
>Priority: Trivial
>  Labels: mapreduce, split
> Attachments: 
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch,
>  
> Added_functionality_to_TableInputFormat_that_allows_splitting_of_regions.patch.1,
>  Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I 
> am working on a patch for the TableInputFormat class that overrides getSplits 
> in order to generate N number of splits per regions and/or N number of splits 
> per job. The idea is to convert the startKey and endKey for each region from 
> byte[] to BigDecimal, take the difference, divide by N, convert back to 
> byte[] and generate splits on the resulting values. Assuming your keys are 
> fully distributed this should generate splits at nearly the same number of 
> rows per split. Any suggestions on this issue are welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)