RE: HBase bulk load

Sriram Muthuswamy Chittathoor Thu, 14 Jan 2010 23:02:25 -0800

See my replies 

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of
stack
Sent: Friday, January 15, 2010 4:40 AM
To: [email protected]
Subject: Re: HBase bulk load


On Wed, Jan 13, 2010 at 10:28 PM, Sriram Muthuswamy Chittathoor <
[email protected]> wrote:

> Some context
> ************
>
> 1. We generate 3 million files a day  --  3 * 365 =  1 billion / year
> (Final is say 10 years though)
> 2. Each file has data relating to a user and day
> 3. Every user will have some activity throughout the year (not
> necessarily on every day)
> 4. Our search is by  {user , data range combination} --  give me data
> for a given user between these date ranges
>
>
So, your key will be userid+day since epoch?   

--- Correct 



> I am assuming I am assuming could be done
>
> 1.  The InputSplits will divide the #splits  based on say some range
of
> days  -- say for a year  we divide into 1 splits (All files for a
month
> goes to the first mapper and so on)


> 2.  Each mapper processes the files and creates {Key , value}
> combinations --  key is composite  {user , date}
>


Mappers should run for about 5-10 minutes each.  How many months of data
do
you think this will be per mapper?
--   Looks like based on the granularity (processing 1 years worth of
data vs 1 months data) and the number of boxes I have it may vary 


--- We need to bulk load 8 years worth of data from our archives.   That
will 8 * 12 months  of data.  

Whats your original key made of?
--  Each Data files is a 4K text data which has 6 players data on an
average.  We will parse it and extract per userid/day data (so many each
of this would be < .5K)

Would you do this step in multiple stages or feed this mapreduce job all
10
years of data? 


Either way I can do.  Since I have 8 years worth of archived data I need
to get them onto to the system as a one time effort.  If I proceed in
this year order will it be fine --  2000 , 2001 , 2002.  The only
requirement is at the end these individual years data (in hfiles) needs
to be loaded in Hbase.   



>
> 3.  Custom Partitioner (say has some scheme where it sends a range of
> users (and all their associated date info) to a particular reducer
>
>
Maybe write this back to hdfs as sequencefiles rather than as hfiles and
then take the output of this jobs reducer and feed these to your
hfileoutputformat job one at a time if you want to piecemeal the
creation of
hfiles (many jobs rather than one big one).  In this case you'd have one
big
table rather than the one per year as you were suggesting.  You might
have
to move things around in hdfs after all jobs were done to put
directories
where loadtable.rb expects to find stuff (or better, just mod
loadtable.rb... its a simple script).

--  Can u give me some link to doing this.   If I am getting u right is
this the sequence 

1.  Start with say year 2000 (1 billion 4k files to be processed and
loaded)
2.  Divide it into splits initially based on just filename ranges
(user/day data is hidden inside the file)
3.  Each mappers gets a bunch of file (if it is 20 mappers then each one
will have to process 50 million 4k files (Seems too much even for a
single year ?? --  should I go to a single month processing at a time
??)
4.  Each mapper parses the file and extract the user/day records
5.  The custom parttioner sends range of users/day to a particular
reducer
6.  reducer in parallel will generate sequence files -- multiple will be
there


My question here is in each year there will be sequence files containing
a range of users data.  Do I need to identify these and put them
together in one hfile as the user/day records for all the 10 years
should be together in the final hfile ? So some manual stuff is required
here taking related sequence files (those containing the same range of
users / day data) and feeding them to  hfileoutputformat job ?


> 4.  U will have the output generated per reducer. Just need to
loadtable
> on this output


> This is what I am thinking  --  instead of loading all the 10 years
data
> into 1 table --  load it into tables per year.  That way I will have
to
> deal with failure of MR on a year granularity.  Because the whole load
> may take weeks
>


You could do this or do the above suggested route.

It shouldn't take weeks.  Ryan is claiming that he put 12B (small) rows
in
two days with his fancy new multiput.  Writing the hfiles should run at
least an order of magnitude faster -- unless your cells are large.

This goes without saying, but I'll say it anyway, please test first with
small datasets to ensure stuff works for you.  Use the head of the 0.20
branch.  It has small fix for a silly bug in KeyValueSortReducer.

-- - Could u also give some links to this multiput technique ??


St.Ack




>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> stack
> Sent: Thursday, January 14, 2010 11:33 AM
> To: [email protected]
> Subject: Re: HBase bulk load
>
> On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
> [email protected]> wrote:
>
> > I am trying to use this technique to say bulk load 20 billion rows.
I
> > tried it on a smaller set 20 million rows. A few things I had to
take
> > care was to write a custom partitioning logic so that a range of
keys
> > only go to a particular reduce since there was some mention of
global
> > ordering.
> > For example  Users  (1 --  1mill) ---> Reducer 1 and so on
> >
> > Good.
>
>
>
> > My questions are:
> > 1.  Can I divide the bulk loading into multiple runs  --  the
existing
> > bulk load bails out if it finds a HDFS output directory with the
same
> > name
> >
>
> No.  Its not currently written to do that but especially if your keys
> are
> ordered, it probably wouldn't take much to make the above work (first
> job
> does the first set of keys, and so on).
>
>
> > 2.  What I want to do is make multiple runs of 10 billion and then
> > combine the output before running  loadtable.rb --  is this possible
?
> > I am thinking this may be required in case my MR bulk loading fails
in
> > between and I need to start from where I crashed
> >
> > Well, MR does retries but, yeah, you could run into some issue at
the
> 10B
> mark and want to then start over from there rather than start from the
> beginning.
>
> One thing that the current setup does not do is remove the task hfile
on
> failure.  We should add this.  Would fix case where when speculative
> execution is enabled, and the speculative tasks are kiled, we don't
> leave
> around half-made hfiles (Currently I believe they they show as
> zero-length
> files).
>
> St.Ack
>
>
>
> > Any tips with huge bulk loading experience ?
> >
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On Behalf Of
> > stack
> > Sent: Thursday, January 14, 2010 6:19 AM
> > To: [email protected]
> > Subject: Re: HBase bulk load
> >
> > See
> >
>
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
> > mapreduce/package-summary.html#bulk
> > St.Ack
> >
> > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <[email protected]> wrote:
> >
> > > Jonathan:
> > > Since you implemented
> > >
> > >
> >
>
https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
> > ASE-48.html
> > > ,
> > > maybe you can point me to some document how bulk load is used ?
> > > I found bin/loadtable.rb and assume that can be used to import
data
> > back
> > > into HBase.
> > >
> > > Thanks
> > >
> >
> > This email is sent for and on behalf of Ivy Comptech Private
Limited.
> Ivy
> > Comptech Private Limited is a limited liability company.
> >
> > This email and any attachments are confidential, and may be legally
> > privileged and protected by copyright. If you are not the intended
> recipient
> > dissemination or copying of this email is prohibited. If you have
> received
> > this in error, please notify the sender by replying by email and
then
> delete
> > the email completely from your system.
> > Any views or opinions are solely those of the sender.  This
> communication
> > is not intended to form a binding contract on behalf of Ivy Comptech
> Private
> > Limited unless expressly indicated to the contrary and properly
> authorised.
> > Any actions taken on the basis of this email are at the recipient's
> own
> > risk.
> >
> > Registered office:
> > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara
Hills,
> > Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994.
> > Registered in India. A list of members' names is available for
> inspection at
> > the registered office.
> >
> >
>

RE: HBase bulk load

Reply via email to