Ok. But can some1 explain why the data size is exploding the way I have mentioned earlier.
I have tried to insert sample data of arnd 12GB. The data occupied by Hbase table is arnd 130GB. All my columns i.e. including the ROWID are strings. I have even tried converting by ROWID to long, but that seems to occupy more space i.e. arnd 150GB. Sample rows 0-<>-f-<>-c-<>-Anarchism 0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy 0-<>-f-<>-e2-<>-anarchy 1-<>-f-<>-c-<>-Anarchism 1-<>-f-<>-e1-<>-anarchy 1-<>-f-<>-e2-<>-state (polity) 2-<>-f-<>-c-<>-Anarchism 2-<>-f-<>-e1-<>-anarchy 2-<>-f-<>-e2-<>-political philosophy 3-<>-f-<>-c-<>-Anarchism 3-<>-f-<>-e1-<>-The Globe and Mail 3-<>-f-<>-e2-<>-anarchy 4-<>-f-<>-c-<>-Anarchism 4-<>-f-<>-e1-<>-anarchy 4-<>-f-<>-e2-<>-stateless society Is there a way I can know the number of bytes occupied by each key:value for each cell ??? On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger < ustaudin...@activequant.org> wrote: > the point, I refer to is not so much about when hbase's server side > flushes, but when the client side flushes. > If you put every value immediately, it will result every time in an RPC > call. If you collect the data on the client side and flush (on the client > side) manually, it will result in one RPC call with hundred or thousand > small puts inside, instead of hundred or thousands individual put RPC > calls. > > Another issue is, I am not so sure what happens if you collect hundreds of > thousands of small puts, which might possibly be bigger than the memstore, > and flush then. I guess the hbase client will hang. > > > > > On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2...@gmail.com > >wrote: > > > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do > > the bulk insert ??? I was of the opinion that Hbase would flush all the > > puts to the disk when it's memstore is filled, whose property is defined > in > > hbase-default.xml. Is my understanding wrong here ??? > > > > > > > > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger < > > ustaudin...@activequant.org> wrote: > > > > > Hi there, > > > > > > while I cannot give you any concrete advice on your particular storage > > > problem, I can share some experiences with you regarding performance. > > > > > > I also bulk import data regularly, which is around 4GB every day in > about > > > 150 files with something between 10'000 to 30'000 lines in it. > > > > > > My first approach was to read every line and put it separately. Which > > > resulted in a load time of about an hour. My next approach was to read > an > > > entire file, put each individual put into a list and then store the > > entire > > > list at once. This works fast in the beginning, but after about 20 > files, > > > the server ran into compactions and couldn't cope with the load and > > > finally, the master crashed, leaving regionserver and zookeeper > running. > > To > > > HBase's defense, I have to say that I did this on a standalone > > installation > > > without Hadoop underneath, so the test may not be entirely fair. > > > Next, I switched to a proper Hadoop layer with HBase on top. I now also > > put > > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have > > > insert times of around 0.5ms per row - which is very decent. My entire > > > import now takes only 7 minutes. > > > > > > I think you must find a balance regarding the performance of your > servers > > > and how quick they are with compactions and the amount of data you put > at > > > once. I have definitely found single puts to result in low performance. > > > > > > Best regards, > > > Ulrich > > > > > > > > > > > > > > > > > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2...@gmail.com > > > >wrote: > > > > > > > No, I split the table on the fly. This I have done because converting > > my > > > > table into Hbase format (rowID, family, qualifier, value) would > result > > in > > > > the input file being arnd 300GB. Hence, I had decided to do the > > splitting > > > > and generating this format on the fly. > > > > > > > > Will this effect the performance so heavily ??? > > > > > > > > On Mon, Dec 5, 2011 at 1:21 AM, <yuzhih...@gmail.com> wrote: > > > > > > > > > May I ask whether you pre-split your table before loading ? > > > > > > > > > > > > > > > > > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <kranthili2...@gmail.com > > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 > > > > > machines > > > > > > and am trying to insert data. 3 of the machines are tasktrackers, > > > with > > > > 4 > > > > > > map tasks each. > > > > > > > > > > > > My data consists of about 1.3 billion rows with 4 columns each > > > > (100GB > > > > > > txt file). The column structure is "rowID, word1, word2, word3". > > My > > > > DFS > > > > > > replication in hadoop and hbase is set to 3 each. I have put only > > one > > > > > > column family and 3 qualifiers for each field (word*). > > > > > > > > > > > > I am using the SampleUploader present in the HBase > distribution. > > > To > > > > > > complete 40% of the insertion, it has taken around 21 hrs and > it's > > > > still > > > > > > running. I have 12 map tasks running.* I would like to know is > the > > > > > > insertion time taken here on expected lines ??? Because when I > used > > > > > lucene, > > > > > > I was able to insert the entire data in about 8 hours.* > > > > > > > > > > > > Also, there seems to be huge explosion of data size here. > With a > > > > > > replication factor of 3 for HBase, I was expecting the table size > > > > > inserted > > > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have, > > > 300GB > > > > > for > > > > > > replicating the data 3 times and 50+ GB for additional storage > > > > > > information). But even for 40% completion of data insertion, the > > > space > > > > > > occupied is around 550GB (Looks like it might take around 1.2TB > for > > > an > > > > > > 100GB file).* I have used the rowID to be a String, instead of > > Long. > > > > Will > > > > > > that account for such rapid increase in data storage??? > > > > > > * > > > > > > > > > > > > Regards, > > > > > > Kranthi > > > > > > > > > > > > > > > > > > > > > -- > > > > Kranthi Reddy. B > > > > > > > > http://www.setusoftware.com/setu/index.htm > > > > > > > > > > > > > > > -- > > Kranthi Reddy. B > > > > http://www.setusoftware.com/setu/index.htm > > > -- Kranthi Reddy. B http://www.setusoftware.com/setu/index.htm