Data taking up too much space when put into HBase

2010-11-09 Thread Hari Sreekumar
Hi, Data seems to be taking up too much space when I put into HBase. e.g, I have a 2 GB text file which seems to be taking up ~70 GB when I dump into HBase. I have block size set to 64 MB and replication=3, which I think is the possible reason for this expansion. But if that is the case, how

Re: Data taking up too much space when put into HBase

2010-11-09 Thread Jean-Daniel Cryans
Each value is stored with it's full key e.g. row key + family + qualifier + timestamp + offsets. You don't give any information regarding how you stored the data, but if you have large enough keys then it should easily explain the bloat. J-D On Tue, Nov 9, 2010 at 9:21 PM, Hari Sreekumar wrote:

Re: Data taking up too much space when put into HBase

2010-11-09 Thread Hari Sreekumar
Ah, so the bloat is not because of the files being 5-6 MB in size? Wouldn't a 6 MB file occupy 64 MB if I set block size as 64 MB? hari On Wed, Nov 10, 2010 at 11:16 AM, Jean-Daniel Cryans wrote: > Each value is stored with it's full key e.g. row key + family + > qualifier + timestamp + offsets.

Re: Data taking up too much space when put into HBase

2010-11-09 Thread Jean-Daniel Cryans
I'm pretty sure that's not how it's reported by the "du" command, but I wouldn't expect to see files of 5MB on average. Can you be more specific? J-D On Tue, Nov 9, 2010 at 9:58 PM, Hari Sreekumar wrote: > Ah, so the bloat is not because of the files being 5-6 MB in size? Wouldn't > a 6 MB file

Re: Data taking up too much space when put into HBase

2010-11-09 Thread Hari Sreekumar
I checked the "browse filesystem" link in the web interface (50070). HBase creates a directly named after the table ,and in the directory, there are files which are 5-6 MB in size, on average. Some are in kbs, and there are some of 12-13 MB size, but most are around 6 MB. I was thinking these file

Re: Data taking up too much space when put into HBase

2010-11-10 Thread Jean-Daniel Cryans
Can you pastebin the output of the lsr command on the table's dir? Thx J-D On Tue, Nov 9, 2010 at 10:54 PM, Hari Sreekumar wrote: > I checked the "browse filesystem" link in the web interface (50070). HBase > creates a directly named after the table ,and in the directory, there are > files whic

Re: Data taking up too much space when put into HBase

2010-11-11 Thread Hari Sreekumar
Here's the output of lsr on one of the tables: drwxr-xr-x - hadoop supergroup 0 2010-11-11 13:33 /hbase/Webevent/1102232448 -rw-r--r-- 3 hadoop supergroup 2318 2010-11-11 13:33 /hbase/Webevent/1102232448/.regioninfo drwxr-xr-x - hadoop supergroup 0 2010-11-11 13:33 /h

Re: Data taking up too much space when put into HBase

2010-11-11 Thread Jean-Daniel Cryans
Oh I see, you are using 4 families. An important thing to know (and it's not super obvious) is that the regions flush on the total size of the memstore across all families (there's one memstore per family, learn more here http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html). This

Re: Data taking up too much space when put into HBase

2010-11-11 Thread Hari Sreekumar
Ah, that's a great piece of info J-D! I had 4 families just as a logical division. I don't think I'm really using the fact that we have 4 different families anywhere. Thanks a lot for the information. thanks, hari On Thu, Nov 11, 2010 at 10:45 PM, Jean-Daniel Cryans wrote: > Oh I see, you are us

Re: Data taking up too much space when put into HBase

2010-11-11 Thread Jeff Whiting
Just to clarify, each column family is stored separately from each other. But within a column family each rowkey => key / value is stored independently. I was under the impression that a rowkey would point to multiple key / value pairs within the column family stores. Am I understanding every

Re: Data taking up too much space when put into HBase

2010-11-12 Thread Debashis Saha
Just to add a note to the comment of J-D: You want more than one column family ( CF-A and CF-B) only when most (or one set) of your application is reading information stored in CF-A and does not care about information in CF-B. In this case separating less used information in different column famil