Re: About HBase Files

2009-09-22 Thread stack
On Tue, Sep 22, 2009 at 10:10 PM, stchu wrote: > Hi Stack and Erik, > > Thanks for your answers. I think the timestamp is also contain in mapfiles > (in binary format?), > am I right? > > Yes, its a serialized long. > Hfile looks better. I will migrate my prog. to hadoop 0.20 and hbase 0.20 >

Re: About HBase Files

2009-09-22 Thread stchu
Hi Stack and Erik, Thanks for your answers. I think the timestamp is also contain in mapfiles (in binary format?), am I right? Hfile looks better. I will migrate my prog. to hadoop 0.20 and hbase 0.20 after I finished my experiments in 0.19. But it needs some efforts for those imcompatible apis..

looking for better search suggestion

2009-09-22 Thread 梁景明
hi, i am looking for a better search suggestion,here is my data id book author 1abcme 2def me 3ghi you and here is my hbase table data 1id:1 1 book:abc abc author:me me 2id:2 2 book:def def author:me me 3id:3 3 book:ghi ghi author:you you when i wa

Re: Hbase and linear scaling with small write intensive clusters

2009-09-22 Thread stack
(Funny, I read the 2MB as 2GB -- yeah, why so small Guy?) On Tue, Sep 22, 2009 at 4:59 PM, Jonathan Gray wrote: > Is there a reason you have the split size set to 2MB? That's rather small > and you'll end up constantly splitting, even once you have good > distribution. > > I'd go for pre-splitt

Re: Hbase and linear scaling with small write intensive clusters

2009-09-22 Thread Jonathan Gray
Is there a reason you have the split size set to 2MB? That's rather small and you'll end up constantly splitting, even once you have good distribution. I'd go for pre-splitting, as others suggest, but with larger region sizes. Ryan Rawson wrote: An interesting thing about HBase is it really

Re: Hbase and linear scaling with small write intensive clusters

2009-09-22 Thread Ryan Rawson
An interesting thing about HBase is it really performs better with more data. Pre-splitting tables is one way. Another performance bottleneck is the write-ahead-log. You can disable it by calling: Put.setWriteToWAL(false); and you will achieve a substantial speedup. Good luck! -ryan On Tue, Sep

Re: Hbase and linear scaling with small write intensive clusters

2009-09-22 Thread stack
Split your table in advance? You can do it from the UI or shell (Script it?) Regards same performance for 10 nodes as for 5 nodes, how many regions in your table? What happens if you pile on more data? The split algorithm will be sped up in coming versions for sure. Two minutes seems like a lo

Hbase and linear scaling with small write intensive clusters

2009-09-22 Thread Molinari, Guy
Hello all, I've been working with HBase for the past few months on a proof of concept/technology adoption evaluation.I wanted to describe my scenario to the user/development community to get some input on my observations. I've written an application that is comprised of two tables

Re: Performance Benchmarks

2009-09-22 Thread Bradford Stephens
Greetings Jon, A quick performance snapshot: I believe with our cluster of 18 nodes (8 cores, 8 GB RAM, 2 x 500 GB drives per node), we were inserting rows of about 5-10kb at the rate of 180,000 /second. That's on a completely untuned cluster. You could see much better performance with proper twea

Re: Best-practice/design query for storing a list/array of values

2009-09-22 Thread Ryan Rawson
Serializing even a large list into 1 column is not a bad thing necessairly. The thing is when you update that column, you have to rewrite the whole thing. If you expect lots of items and a frequent update it might be better to store each item in a column as stack says above. Another question you c

Re: Best-practice/design query for storing a list/array of values

2009-09-22 Thread stack
Would a family devoted to your list -- called 'list'! -- work for you? You could get individual members of the list by doing list:membername or get them all by getting all elements of the family, etc. St.Ack On Tue, Sep 22, 2009 at 9:57 AM, Keith Thomas wrote: > > I have a family which contains

Performance Benchmarks

2009-09-22 Thread Jonathan Holloway
Hi all, I was looking at the HBase Goes Realtime presentation yesterday and came across these numbers: Tall Table 1 million rows with a single column * Insert - 0.24 ms per row * Read - 1.42ms per row * Full Scan - 11 seconds Wide Table 1000 Rows with 20,000 columns * Insert - 312 ms per row * R

Best-practice/design query for storing a list/array of values

2009-09-22 Thread Keith Thomas
I have a family which contains an array, or list, of values. I don't mind whether it is an array or a list or even an arraylist :) At the moment I have gone down the quick and dirty route of serializing my list into one column. While functionally this works sufficiently well to allow me to keep

Re: About HBase Files

2009-09-22 Thread stack
Yes, what Erik said. MapFile is a binary format. What you are some preamble up front listing the key and value class types plus some miscellaneous meta data. Then, per key and value, these are serialized Writable types. Move to hbase 0.20.0. It uses hfile instead of mapfile. There is a nice l

Re: About HBase Files

2009-09-22 Thread Erik Holstad
Hey Stchu! Not exactly sure what the "messy code" is except the it looks like non printable binary data. Depending on where you look I think it is values, offset etc. The reason that we are keeping the family stored in the files are to leave the door open for something called locality groups. The

About HBase Files

2009-09-22 Thread stchu
Hi, I use Hadoop 0.19.1 and HBase 0.19.3. I write a simple table which have 2 column families (Level0:trail_id, Level1:trail_id). And I put the data (4 rows) into hbase table: 120_25 column=Level0:trail_id, timestamp=2009091613240001, value=3;21234 121.1_23.4