Some context
************

1. We generate 3 million files a day  --  3 * 365 =  1 billion / year
(Final is say 10 years though)
2. Each file has data relating to a user and day 
3. Every user will have some activity throughout the year (not
necessarily on every day)
4. Our search is by  {user , data range combination} --  give me data
for a given user between these date ranges 

I am assuming I am assuming could be done

1.  The InputSplits will divide the #splits  based on say some range of
days  -- say for a year  we divide into 1 splits (All files for a month
goes to the first mapper and so on)

2.  Each mapper processes the files and creates {Key , value}
combinations --  key is composite  {user , date}

3.  Custom Partitioner (say has some scheme where it sends a range of
users (and all their associated date info) to a particular reducer 

4.  U will have the output generated per reducer. Just need to loadtable
on this output 

This is what I am thinking  --  instead of loading all the 10 years data
into 1 table --  load it into tables per year.  That way I will have to
deal with failure of MR on a year granularity.  Because the whole load
may take weeks

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of
stack
Sent: Thursday, January 14, 2010 11:33 AM
To: [email protected]
Subject: Re: HBase bulk load

On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor <
[email protected]> wrote:

> I am trying to use this technique to say bulk load 20 billion rows.  I
> tried it on a smaller set 20 million rows. A few things I had to take
> care was to write a custom partitioning logic so that a range of keys
> only go to a particular reduce since there was some mention of global
> ordering.
> For example  Users  (1 --  1mill) ---> Reducer 1 and so on
>
> Good.



> My questions are:
> 1.  Can I divide the bulk loading into multiple runs  --  the existing
> bulk load bails out if it finds a HDFS output directory with the same
> name
>

No.  Its not currently written to do that but especially if your keys
are
ordered, it probably wouldn't take much to make the above work (first
job
does the first set of keys, and so on).


> 2.  What I want to do is make multiple runs of 10 billion and then
> combine the output before running  loadtable.rb --  is this possible ?
> I am thinking this may be required in case my MR bulk loading fails in
> between and I need to start from where I crashed
>
> Well, MR does retries but, yeah, you could run into some issue at the
10B
mark and want to then start over from there rather than start from the
beginning.

One thing that the current setup does not do is remove the task hfile on
failure.  We should add this.  Would fix case where when speculative
execution is enabled, and the speculative tasks are kiled, we don't
leave
around half-made hfiles (Currently I believe they they show as
zero-length
files).

St.Ack



> Any tips with huge bulk loading experience ?
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> stack
> Sent: Thursday, January 14, 2010 6:19 AM
> To: [email protected]
> Subject: Re: HBase bulk load
>
> See
>
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/
> mapreduce/package-summary.html#bulk
> St.Ack
>
> On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <[email protected]> wrote:
>
> > Jonathan:
> > Since you implemented
> >
> >
>
https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB
> ASE-48.html
> > ,
> > maybe you can point me to some document how bulk load is used ?
> > I found bin/loadtable.rb and assume that can be used to import data
> back
> > into HBase.
> >
> > Thanks
> >
>
> This email is sent for and on behalf of Ivy Comptech Private Limited.
Ivy
> Comptech Private Limited is a limited liability company.
>
> This email and any attachments are confidential, and may be legally
> privileged and protected by copyright. If you are not the intended
recipient
> dissemination or copying of this email is prohibited. If you have
received
> this in error, please notify the sender by replying by email and then
delete
> the email completely from your system.
> Any views or opinions are solely those of the sender.  This
communication
> is not intended to form a binding contract on behalf of Ivy Comptech
Private
> Limited unless expressly indicated to the contrary and properly
authorised.
> Any actions taken on the basis of this email are at the recipient's
own
> risk.
>
> Registered office:
> Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara Hills,
> Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994.
> Registered in India. A list of members' names is available for
inspection at
> the registered office.
>
>

Reply via email to