speeding up rowcount

2011-10-09 Thread Rita
Hi, I have been doing a rowcount via mapreduce and its taking about 4-5 hours to count a 500million rows in a table. I was wondering if there are any map reduce tunings I can do so it will go much faster. I have 10 node cluster, each node with 8CPUs with 64GB of memory. Any tuning advice would

Re: speeding up rowcount

2011-10-09 Thread Ted Yu
I guess your hbase.hregion.max.filesize is quite high. If possible, lower its value so that you have smaller regions. On Sun, Oct 9, 2011 at 7:50 AM, Rita rmorgan...@gmail.com wrote: Hi, I have been doing a rowcount via mapreduce and its taking about 4-5 hours to count a 500million rows in

Re: speeding up rowcount

2011-10-09 Thread Himanshu Vashishtha
Since a MapReduce is a separate process, try with a high Scan cache value. http://hbase.apache.org/book.html#perf.hbase.client.caching Himanshu On Sun, Oct 9, 2011 at 9:09 AM, Ted Yu yuzhih...@gmail.com wrote: I guess your hbase.hregion.max.filesize is quite high. If possible, lower its value

Re: speeding up rowcount

2011-10-09 Thread Rita
Thanks for the responses. Where do I set the high Scan cache values? On Sun, Oct 9, 2011 at 11:19 AM, Himanshu Vashishtha hvash...@cs.ualberta.ca wrote: Since a MapReduce is a separate process, try with a high Scan cache value. http://hbase.apache.org/book.html#perf.hbase.client.caching

Re: speeding up rowcount

2011-10-09 Thread Ted Yu
Excellent question. There seems to be a bug for RowCounter. In TableInputFormat: if (conf.get(SCAN_CACHEDROWS) != null) { scan.setCaching(Integer.parseInt(conf.get(SCAN_CACHEDROWS))); } But I don't see SCAN_CACHEDROWS in either TableMapReduceUtil or RowCounter. Mind

Re: MiniDFSCluster configuration via spring

2011-10-09 Thread Eugene Koontz
On 10/8/11 5:31 AM, Eric Charles wrote: Sorry, fallback situation is https://svn.apache.org/repos/asf/james/mailbox/trunk/spring/src/main/resources/META-INF/org/apache/james/spring-mailbox-hbase.xml The link [1] in previous mail is what we want to achieve but we get the

Re: speeding up rowcount

2011-10-09 Thread Himanshu Vashishtha
Since a RowCounter uses FirstKeyOnlyFilter, we can have a default Scan cache value of 500 or so? Himanshu On Sun, Oct 9, 2011 at 9:44 AM, Ted Yu yuzhih...@gmail.com wrote: Excellent question. There seems to be a bug for RowCounter. In TableInputFormat:        if (conf.get(SCAN_CACHEDROWS)

Re: speeding up rowcount

2011-10-09 Thread Ted Yu
That is fine. We should also allow users to override cache value. On Sun, Oct 9, 2011 at 9:26 AM, Himanshu Vashishtha hvash...@cs.ualberta.ca wrote: Since a RowCounter uses FirstKeyOnlyFilter, we can have a default Scan cache value of 500 or so? Himanshu On Sun, Oct 9, 2011 at 9:44 AM,

Re: Using Scans in parallel

2011-10-09 Thread Bryan Keller
I was not able to get consistent results using multiple scanners in parallel on a table. I implemented a counter test that used 8 scanners in parallel on a table with 2m rows with 2k+ columns each, and the results were not consistent. There were no errors thrown, but the count was off by as

Re: Using Scans in parallel

2011-10-09 Thread Bryan Keller
On further thought, it seems this might be a serious issue, as two unrelated processes within an application may be scanning the same table at the same time. On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote: I was not able to get consistent results using multiple scanners in parallel on a

Re: Using Scans in parallel

2011-10-09 Thread lars hofhansl
Which version of HBase? Are there concurrent inserts? If so, do you see splits in the log files happening while you do the scanning? I am pretty sure this has nothing to do with concurrent scans. From: Bryan Keller brya...@gmail.com To: Bryan Keller

Re: Using Scans in parallel

2011-10-09 Thread Bryan Keller
This is just scanning (reads). I'll need to do more testing to find a cause, hopefully it is something with my test. On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote: Which version of HBase? Are there concurrent inserts? If so, do you see splits in the log files happening while you do the

Re: Using Scans in parallel

2011-10-09 Thread lars hofhansl
How frequently does this happen? I did notice a while ago in the code that scanner ids are drawn just from a Random number generator. So in theory it would be possible that multiple concurrent scans draw the same scanner id. Since these are longs, this is astronomically unlikely, though

Re: Spaces disappear in HBase?

2011-10-09 Thread Ben West
Thanks Andy! I do not see this in the wiki anywhere (http://wiki.apache.org/hadoop/Hbase/Stargate) - could we put it in? I'm not certain I know what exactly needs to be encoded: just values when you're inserting? How about the row names when you're scanning? (I've been having trouble with

Re: Using Scans in parallel

2011-10-09 Thread Himanshu Vashishtha
I don't think it will work without exception in that case. These scanner Ids are generated from Random instance of HRegionServer. In case there is same scannerId then one will get a LeaseStillHeldException in the addScanner method? Himanshu On Sun, Oct 9, 2011 at 3:53 PM, lars hofhansl

Re: MiniDFSCluster configuration via spring

2011-10-09 Thread Eric Charles
Hi Eugene, Thx for the link! I quickly traversed the jira and need to take more time to understand the work done so far on it. However, the default 1000 value is already a large one and I don't think I hit it (but who knows...). In general, the tests reuse the configuration from the

Re: Building HBase trunk on windows using cygwin

2011-10-09 Thread Suraj Varma
Just like trunk, it also fails for 0.92-SNAPSHOT on cygwin; a hack similar to the below was needed to get it built. I would certainly like the maven build to work cross-platform. I can open a Jira if there are no objections ... --Suraj On Wed, Oct 5, 2011 at 5:43 AM, Mayuresh

Re: Building HBase trunk on windows using cygwin

2011-10-09 Thread Stack
Please open a JIRA. A patch would be much appreciated (and some testimony that works for you since few of us, I believe, dev on windows). Thanks Suraj, St.Ack On Sun, Oct 9, 2011 at 4:39 PM, Suraj Varma svarma...@gmail.com wrote: Just like trunk, it also fails for 0.92-SNAPSHOT on cygwin; a

Re: speeding up rowcount

2011-10-09 Thread Ryan Rawson
Are you sure the job is running on the cluster and not running in single node mode? This happens a lot... On Oct 9, 2011 7:50 AM, Rita rmorgan...@gmail.com wrote: Hi, I have been doing a rowcount via mapreduce and its taking about 4-5 hours to count a 500million rows in a table. I was

Re: speeding up rowcount

2011-10-09 Thread lars hofhansl
Be aware that the contract for a scan is to return all rows sorted by rowkey, hence it cannot scan regions in parallel by default.I have not played much HBase with MapReduce, but if order is not important you can to split the scan into multiple scans. - Original Message - From: Tom

Re: Using Scans in parallel

2011-10-09 Thread Bryan Keller
This is 100% reproducible for me, so I doubt it is related to random number generation. On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote: How frequently does this happen? I did notice a while ago in the code that scanner ids are drawn just from a Random number generator. So in theory it

Re: speeding up rowcount

2011-10-09 Thread Himanshu Vashishtha
MapReduce support in HBase inherently provides parallelism such that each Region is given to one mapper. Himanshu On Sun, Oct 9, 2011 at 6:44 PM, lars hofhansl lhofha...@yahoo.com wrote: Be aware that the contract for a scan is to return all rows sorted by rowkey, hence it cannot scan regions

Re: Using Scans in parallel

2011-10-09 Thread Joe Pallas
On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote: How frequently does this happen? I did notice a while ago in the code that scanner ids are drawn just from a Random number generator. Really? That doesn’t seem like a good idea. So in theory it would be possible that multiple concurrent

Re: Using Scans in parallel

2011-10-09 Thread Himanshu Vashishtha
Interesting. Hey Bryan, can you please share the stats about: how many Regions, how many Region Servers, time taken by Serial scanner and with 8 parallel scanners. Himanshu On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller brya...@gmail.com wrote: This is 100% reproducible for me, so I doubt it is

Split bug will cause data loss or cann't read write

2011-10-09 Thread BlueDavy Lin
hi! Replay the problem is very easy,I added comment to https://issues.apache.org/jira/browse/HBASE-3872. We change code like below to fix the fatal bug: if (!testing) { this.journal.add(JournalEntry.PONR);

Re: Split bug will cause data loss or cann't read write

2011-10-09 Thread Ted Yu
I followed up on HBASE-3872. Please open a new JIRA. Thanks On Sun, Oct 9, 2011 at 7:07 PM, BlueDavy Lin blued...@gmail.com wrote: hi! Replay the problem is very easy,I added comment to https://issues.apache.org/jira/browse/HBASE-3872. We change code like below to fix the fatal

Re: Using Scans in parallel

2011-10-09 Thread Bryan Keller
Sure. 2 region servers with 5 disks each. Table has 2 column families and 113 regions total for 2m rows. I'm scanning just one of the families. Performance with the 8 parallel scanners is 4x faster than the serial scanner (20m vs 80m roughly). On Oct 9, 2011, at 7:00 PM, Himanshu Vashishtha

Re: Using Scans in parallel

2011-10-09 Thread Bryan Keller
BTW, a map reduce job can scan the table in 6m (both column families), including some processing. So that is the fastest approach. On Oct 9, 2011, at 8:03 PM, Bryan Keller wrote: Sure. 2 region servers with 5 disks each. Table has 2 column families and 113 regions total for 2m rows. I'm

basic question for newbie

2011-10-09 Thread Syg raf
Hello folks, I'm just starting with HBase and have a couple of rudimentary questions about how to use it: I have a simple Java program that have been developer for the purpose of learning HBase, the program was built with MySQL as database, it contain 3 tables: Authors, Books and AuthorBook

Re: basic question for newbie

2011-10-09 Thread Sam Seigal
Start off with the HBase book, great resource for getting started: http://ofps.oreilly.com/titles/9781449396107/ On Sun, Oct 9, 2011 at 10:25 PM, Syg raf sygura2...@gmail.com wrote: Hello folks, I'm just starting with HBase and have a couple of rudimentary questions about how to use it:

Re: basic question for newbie

2011-10-09 Thread Syg raf
Thanks Sam for the link, I'm going to get the book ordered. Just to start and get familliar with HBase, can yo tell if everything go into one table and the data separated into column families? if not, based on what we create multiple tables ? Thanks On Mon, Oct 10, 2011 at 1:28 AM, Sam Seigal