Re: Unexpected Data insertion time and Data size explosion

kranthi reddy Mon, 05 Dec 2011 08:33:39 -0800

Ok. But can some1 explain why the data size is exploding the way I have
mentioned earlier.


I have tried to insert sample data of arnd 12GB. The data occupied by Hbase
table is arnd 130GB. All my columns i.e. including the ROWID are strings. I
have even tried converting by ROWID to long, but that seems to occupy more
space i.e. arnd 150GB.

Sample rows

0-<>-f-<>-c-<>-Anarchism
0-<>-f-<>-e1-<>-Routledge Encyclopedia of Philosophy
0-<>-f-<>-e2-<>-anarchy
1-<>-f-<>-c-<>-Anarchism
1-<>-f-<>-e1-<>-anarchy
1-<>-f-<>-e2-<>-state (polity)
2-<>-f-<>-c-<>-Anarchism
2-<>-f-<>-e1-<>-anarchy
2-<>-f-<>-e2-<>-political philosophy
3-<>-f-<>-c-<>-Anarchism
3-<>-f-<>-e1-<>-The Globe and Mail
3-<>-f-<>-e2-<>-anarchy
4-<>-f-<>-c-<>-Anarchism
4-<>-f-<>-e1-<>-anarchy
4-<>-f-<>-e2-<>-stateless society

Is there a way I can know the number of bytes occupied by each key:value
for each cell ???

On Mon, Dec 5, 2011 at 8:43 PM, Ulrich Staudinger <
ustaudin...@activequant.org> wrote:

> the point, I refer to is not so much about when hbase's server side
> flushes, but when the client side flushes.
> If you put every value immediately, it will result every time in an RPC
> call. If you collect the data on the client side and flush (on the client
> side) manually, it will result in one RPC call with hundred or thousand
> small puts inside, instead of hundred or thousands individual put RPC
> calls.
>
> Another issue is, I am not so sure what happens if you collect hundreds of
> thousands of small puts, which might possibly be bigger than the memstore,
> and flush then. I guess the hbase client will hang.
>
>
>
>
> On Mon, Dec 5, 2011 at 10:10 AM, kranthi reddy <kranthili2...@gmail.com
> >wrote:
>
> > Doesn't the configuration setting "hbase.hregion.memstore.flush.size" do
> > the bulk insert ??? I was of the opinion that Hbase would flush all the
> > puts to the disk when it's memstore is filled, whose property is defined
> in
> > hbase-default.xml. Is my understanding wrong here ???
> >
> >
> >
> > On Mon, Dec 5, 2011 at 1:26 PM, Ulrich Staudinger <
> > ustaudin...@activequant.org> wrote:
> >
> > > Hi there,
> > >
> > > while I cannot give you any concrete advice on your particular storage
> > > problem, I can share some experiences with you regarding performance.
> > >
> > > I also bulk import data regularly, which is around 4GB every day in
> about
> > > 150 files with something between 10'000 to 30'000 lines in it.
> > >
> > > My first approach was to read every line and put it separately. Which
> > > resulted in a load time of about an hour. My next approach was to read
> an
> > > entire file, put each individual put into a list and then store the
> > entire
> > > list at once. This works fast in the beginning, but after about 20
> files,
> > > the server ran into compactions and couldn't cope with the load and
> > > finally, the master crashed, leaving regionserver and zookeeper
> running.
> > To
> > > HBase's defense, I have to say that I did this on a standalone
> > installation
> > > without Hadoop underneath, so the test may not be entirely fair.
> > > Next, I switched to a proper Hadoop layer with HBase on top. I now also
> > put
> > > around 100 - 1000 lines (or puts) at once, in a bulk commit, and have
> > > insert times of around 0.5ms per row - which is very decent. My entire
> > > import now takes only 7 minutes.
> > >
> > > I think you must find a balance regarding the performance of your
> servers
> > > and how quick they are with compactions and the amount of data you put
> at
> > > once. I have definitely found single puts to result in low performance.
> > >
> > > Best regards,
> > > Ulrich
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Dec 5, 2011 at 6:23 AM, kranthi reddy <kranthili2...@gmail.com
> > > >wrote:
> > >
> > > > No, I split the table on the fly. This I have done because converting
> > my
> > > > table into Hbase format (rowID, family, qualifier, value) would
> result
> > in
> > > > the input file being arnd 300GB. Hence, I had decided to do the
> > splitting
> > > > and generating this format on the fly.
> > > >
> > > > Will this effect the performance so heavily ???
> > > >
> > > > On Mon, Dec 5, 2011 at 1:21 AM, <yuzhih...@gmail.com> wrote:
> > > >
> > > > > May I ask whether you pre-split your table before loading ?
> > > > >
> > > > >
> > > > >
> > > > > On Dec 4, 2011, at 6:19 AM, kranthi reddy <kranthili2...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > >    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4
> > > > > machines
> > > > > > and am trying to insert data. 3 of the machines are tasktrackers,
> > > with
> > > > 4
> > > > > > map tasks each.
> > > > > >
> > > > > >    My data consists of about 1.3 billion rows with 4 columns each
> > > > (100GB
> > > > > > txt file). The column structure is "rowID, word1, word2, word3".
> >  My
> > > > DFS
> > > > > > replication in hadoop and hbase is set to 3 each. I have put only
> > one
> > > > > > column family and 3 qualifiers for each field (word*).
> > > > > >
> > > > > >    I am using the SampleUploader present in the HBase
> distribution.
> > > To
> > > > > > complete 40% of the insertion, it has taken around 21 hrs and
> it's
> > > > still
> > > > > > running. I have 12 map tasks running.* I would like to know is
> the
> > > > > > insertion time taken here on expected lines ??? Because when I
> used
> > > > > lucene,
> > > > > > I was able to insert the entire data in about 8 hours.*
> > > > > >
> > > > > >    Also, there seems to be huge explosion of data size here.
> With a
> > > > > > replication factor of 3 for HBase, I was expecting the table size
> > > > > inserted
> > > > > > to be around 350-400GB. (350-400GB for an 100GB txt file I have,
> > > 300GB
> > > > > for
> > > > > > replicating the data 3 times and 50+ GB for additional storage
> > > > > > information). But even for 40% completion of data insertion, the
> > > space
> > > > > > occupied is around 550GB (Looks like it might take around 1.2TB
> for
> > > an
> > > > > > 100GB file).* I have used the rowID to be a String, instead of
> > Long.
> > > > Will
> > > > > > that account for such rapid increase in data storage???
> > > > > > *
> > > > > >
> > > > > > Regards,
> > > > > > Kranthi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kranthi Reddy. B
> > > >
> > > > http://www.setusoftware.com/setu/index.htm
> > > >
> > >
> >
> >
> >
> > --
> > Kranthi Reddy. B
> >
> > http://www.setusoftware.com/setu/index.htm
> >
>



-- 
Kranthi Reddy. B

http://www.setusoftware.com/setu/index.htm

Re: Unexpected Data insertion time and Data size explosion

Reply via email to