Hi,
I have been doing a rowcount via mapreduce and its taking about 4-5 hours to
count a 500million rows in a table. I was wondering if there are any map
reduce tunings I can do so it will go much faster.
I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any tuning
advice would
I guess your hbase.hregion.max.filesize is quite high.
If possible, lower its value so that you have smaller regions.
On Sun, Oct 9, 2011 at 7:50 AM, Rita rmorgan...@gmail.com wrote:
Hi,
I have been doing a rowcount via mapreduce and its taking about 4-5 hours
to
count a 500million rows in
Since a MapReduce is a separate process, try with a high Scan cache value.
http://hbase.apache.org/book.html#perf.hbase.client.caching
Himanshu
On Sun, Oct 9, 2011 at 9:09 AM, Ted Yu yuzhih...@gmail.com wrote:
I guess your hbase.hregion.max.filesize is quite high.
If possible, lower its value
Thanks for the responses.
Where do I set the high Scan cache values?
On Sun, Oct 9, 2011 at 11:19 AM, Himanshu Vashishtha
hvash...@cs.ualberta.ca wrote:
Since a MapReduce is a separate process, try with a high Scan cache value.
http://hbase.apache.org/book.html#perf.hbase.client.caching
Excellent question.
There seems to be a bug for RowCounter.
In TableInputFormat:
if (conf.get(SCAN_CACHEDROWS) != null) {
scan.setCaching(Integer.parseInt(conf.get(SCAN_CACHEDROWS)));
}
But I don't see SCAN_CACHEDROWS in either TableMapReduceUtil or RowCounter.
Mind
On 10/8/11 5:31 AM, Eric Charles wrote:
Sorry, fallback situation is
https://svn.apache.org/repos/asf/james/mailbox/trunk/spring/src/main/resources/META-INF/org/apache/james/spring-mailbox-hbase.xml
The link [1] in previous mail is what we want to achieve but we get
the
Since a RowCounter uses FirstKeyOnlyFilter, we can have a default Scan
cache value of 500 or so?
Himanshu
On Sun, Oct 9, 2011 at 9:44 AM, Ted Yu yuzhih...@gmail.com wrote:
Excellent question.
There seems to be a bug for RowCounter.
In TableInputFormat:
if (conf.get(SCAN_CACHEDROWS)
That is fine.
We should also allow users to override cache value.
On Sun, Oct 9, 2011 at 9:26 AM, Himanshu Vashishtha hvash...@cs.ualberta.ca
wrote:
Since a RowCounter uses FirstKeyOnlyFilter, we can have a default Scan
cache value of 500 or so?
Himanshu
On Sun, Oct 9, 2011 at 9:44 AM,
I was not able to get consistent results using multiple scanners in parallel on
a table. I implemented a counter test that used 8 scanners in parallel on a
table with 2m rows with 2k+ columns each, and the results were not consistent.
There were no errors thrown, but the count was off by as
On further thought, it seems this might be a serious issue, as two unrelated
processes within an application may be scanning the same table at the same time.
On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
I was not able to get consistent results using multiple scanners in parallel
on a
Which version of HBase?
Are there concurrent inserts? If so, do you see splits in the log files
happening while you do the scanning?
I am pretty sure this has nothing to do with concurrent scans.
From: Bryan Keller brya...@gmail.com
To: Bryan Keller
This is just scanning (reads). I'll need to do more testing to find a cause,
hopefully it is something with my test.
On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:
Which version of HBase?
Are there concurrent inserts? If so, do you see splits in the log files
happening while you do the
How frequently does this happen?
I did notice a while ago in the code that scanner ids are drawn just from a
Random number generator.
So in theory it would be possible that multiple concurrent scans draw the same
scanner id.
Since these are longs, this is astronomically unlikely, though
Thanks Andy!
I do not see this in the wiki anywhere
(http://wiki.apache.org/hadoop/Hbase/Stargate) - could we put it in? I'm not
certain I know what exactly needs to be encoded: just values when you're
inserting? How about the row names when you're scanning? (I've been having
trouble with
I don't think it will work without exception in that case. These
scanner Ids are generated from Random instance of HRegionServer.
In case there is same scannerId then one will get a
LeaseStillHeldException in the addScanner method?
Himanshu
On Sun, Oct 9, 2011 at 3:53 PM, lars hofhansl
Hi Eugene,
Thx for the link! I quickly traversed the jira and need to take more
time to understand the work done so far on it.
However, the default 1000 value is already a large one and I don't think
I hit it (but who knows...).
In general, the tests reuse the configuration from the
Just like trunk, it also fails for 0.92-SNAPSHOT on cygwin; a hack
similar to the below was needed to get it built.
I would certainly like the maven build to work cross-platform.
I can open a Jira if there are no objections ...
--Suraj
On Wed, Oct 5, 2011 at 5:43 AM, Mayuresh
Please open a JIRA. A patch would be much appreciated (and some
testimony that works for you since few of us, I believe, dev on
windows).
Thanks Suraj,
St.Ack
On Sun, Oct 9, 2011 at 4:39 PM, Suraj Varma svarma...@gmail.com wrote:
Just like trunk, it also fails for 0.92-SNAPSHOT on cygwin; a
Are you sure the job is running on the cluster and not running in single
node mode? This happens a lot...
On Oct 9, 2011 7:50 AM, Rita rmorgan...@gmail.com wrote:
Hi,
I have been doing a rowcount via mapreduce and its taking about 4-5 hours
to
count a 500million rows in a table. I was
Be aware that the contract for a scan is to return all rows sorted by rowkey,
hence it cannot scan regions in parallel by default.I have not played much
HBase with MapReduce, but if order is not important you can to split the scan
into multiple scans.
- Original Message -
From: Tom
This is 100% reproducible for me, so I doubt it is related to random number
generation.
On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote:
How frequently does this happen?
I did notice a while ago in the code that scanner ids are drawn just from a
Random number generator.
So in theory it
MapReduce support in HBase inherently provides parallelism such that
each Region is given to one mapper.
Himanshu
On Sun, Oct 9, 2011 at 6:44 PM, lars hofhansl lhofha...@yahoo.com wrote:
Be aware that the contract for a scan is to return all rows sorted by rowkey,
hence it cannot scan regions
On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote:
How frequently does this happen?
I did notice a while ago in the code that scanner ids are drawn just from a
Random number generator.
Really? That doesn’t seem like a good idea.
So in theory it would be possible that multiple concurrent
Interesting.
Hey Bryan, can you please share the stats about: how many Regions, how
many Region Servers, time taken by Serial scanner and with 8 parallel
scanners.
Himanshu
On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller brya...@gmail.com wrote:
This is 100% reproducible for me, so I doubt it is
hi!
Replay the problem is very easy,I added comment to
https://issues.apache.org/jira/browse/HBASE-3872.
We change code like below to fix the fatal bug:
if (!testing) {
this.journal.add(JournalEntry.PONR);
I followed up on HBASE-3872.
Please open a new JIRA.
Thanks
On Sun, Oct 9, 2011 at 7:07 PM, BlueDavy Lin blued...@gmail.com wrote:
hi!
Replay the problem is very easy,I added comment to
https://issues.apache.org/jira/browse/HBASE-3872.
We change code like below to fix the fatal
Sure. 2 region servers with 5 disks each. Table has 2 column families and 113
regions total for 2m rows. I'm scanning just one of the families. Performance
with the 8 parallel scanners is 4x faster than the serial scanner (20m vs 80m
roughly).
On Oct 9, 2011, at 7:00 PM, Himanshu Vashishtha
BTW, a map reduce job can scan the table in 6m (both column families),
including some processing. So that is the fastest approach.
On Oct 9, 2011, at 8:03 PM, Bryan Keller wrote:
Sure. 2 region servers with 5 disks each. Table has 2 column families and 113
regions total for 2m rows. I'm
Hello folks,
I'm just starting with HBase and have a couple of rudimentary questions
about how to use it:
I have a simple Java program that have been developer for the purpose of
learning HBase, the program was built with MySQL as database, it contain 3
tables: Authors, Books and AuthorBook
Start off with the HBase book, great resource for getting started:
http://ofps.oreilly.com/titles/9781449396107/
On Sun, Oct 9, 2011 at 10:25 PM, Syg raf sygura2...@gmail.com wrote:
Hello folks,
I'm just starting with HBase and have a couple of rudimentary questions
about how to use it:
Thanks Sam for the link, I'm going to get the book ordered.
Just to start and get familliar with HBase, can yo tell if everything go
into one table and the data separated into column families? if not, based on
what we create multiple tables ?
Thanks
On Mon, Oct 10, 2011 at 1:28 AM, Sam Seigal
31 matches
Mail list logo