Hi Xueling,

Here's a general outline:

My guess is that your "position of match" field is bounded (perhaps by the
number of base pairs in the human genome?) Given this, you can probably
write a very simple Partitioner implementation that divides this field into
ranges, each with an approximately equal number of records.

Next, write a simple MR job which takes in a line of data, and outputs the
same line, but with the position-of-match as the key. This will get
partitioned by the above function, so you end up with each reducer receiving
all of the records in a given range.

In the reducer, simply output every 1000th position into your "sparse"
output file (along with the non-sparse output file offset), and every
position into the non-sparse output file.

In your realtime query server (not part of Hadoop), load the "sparse" file
into RAM and perform binary search, etc - find the "bins" which the range
endpoints land in, and then open the non-sparse output on HDFS to finish the
count.

Hope that helps.

Thanks
-Todd

On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu <x...@systemsbiology.org> wrote:

> Rephrase the sentence "Or what APIs I should start with for my testing?": I
> mean "What HDFS APIs I should start to look into for my testing?
>
> Thanks,
> Xueling
>
> On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <x...@systemsbiology.org>
> wrote:
>
> > Hi Todd:
> >
> > After finishing some tasks I finally get back to HDFS testing.
> >
> > One question for your last reply to this thread: Are there any code
> > examples close to your second and third recommendations? Or what APIs I
> > should start with for my testing?
> >
> > Thanks.
> > Xueling
> >
> >
> > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <t...@cloudera.com> wrote:
> >
> >> Hi Xueling,
> >>
> >> In that case, I would recommend the following:
> >>
> >> 1) Put all of your data on HDFS
> >> 2) Write a MapReduce job that sorts the data by position of match
> >> 3) As a second output of this job, you can write a "sparse index" -
> >> basically a set of entries like this:
> >>
> >> <position of match> <offset into file> <number of entries following>
> >>
> >> where you're basically giving offsets into every 10K records or so. If
> >> you index every 10K records, then 5 billion total will mean 100,000
> >> index entries. Each index entry shouldn't be more than 20 bytes, so
> >> 100,000 entries will be 2MB. This is super easy to fit into memory.
> >> (you could probably index every 100th record instead and end up with
> >> 200MB, still easy to fit in memory)
> >>
> >> Then to satisfy your count-range query, you can simply scan your
> >> in-memory sparse index. Some of the indexed blocks will be completely
> >> included in the range, in which case you just add up the "number of
> >> entries following" column. The start and finish block will be
> >> partially covered, so you can use the file offset info to load that
> >> file off HDFS, start reading at that offset, and finish the count.
> >>
> >> Total time per query should be <100ms no problem.
> >>
> >> -Todd
> >>
> >> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <x...@systemsbiology.org>
> >> wrote:
> >> > Hi Todd:
> >> >
> >> > Thank you for your reply.
> >> >
> >> > The datasets wont be updated often. But the query against a data set
> is
> >> > frequent. The quicker the query, the better. For example we have done
> >> > testing on a Mysql database (5 billion records randomly scattered into
> >> 24
> >> > tables) and the slowest query against the biggest table (400,000,000
> >> > records) is around 12 mins. So if using any Hadoop product can speed
> up
> >> the
> >> > search then the product is what we are looking for.
> >> >
> >> > Cheers,
> >> > Xueling
> >> >
> >> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <t...@cloudera.com>
> wrote:
> >> >
> >> >> Hi Xueling,
> >> >>
> >> >> One important question that can really change the answer:
> >> >>
> >> >> How often does the dataset change? Can the changes be merged in in
> >> >> bulk every once in a while, or do you need to actually update them
> >> >> randomly very often?
> >> >>
> >> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1
> second,
> >> or
> >> >> 10ms?
> >> >>
> >> >> Thanks
> >> >> -Todd
> >> >>
> >> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu <
> x...@systemsbiology.org>
> >> >> wrote:
> >> >> >  Hi there:
> >> >> >
> >> >> > I am researching Hadoop to see which of its products suits our need
> >> for
> >> >> > quick queries against large data sets (billions of records per set)
> >> >> >
> >> >> > The queries will be performed against chip sequencing data. Each
> >> record
> >> >> is
> >> >> > one line in a file. To be clear below shows a sample record in the
> >> data
> >> >> set.
> >> >> >
> >> >> >
> >> >> > one line (record) looks like: 1-1-174-418
> TGTGTCCCTTTGTAATGAATCACTATC
> >> U2
> >> >> 0 0
> >> >> > 1 4 *103570835* F .. 23G 24C
> >> >> >
> >> >> > The highlighted field is called "position of match" and the query
> we
> >> are
> >> >> > interested in is the # of sequences in a certain range of this
> >> "position
> >> >> of
> >> >> > match". For instance the range can be "position of match" > 200 and
> >> >> > "position of match" + 36 < 200,000.
> >> >> >
> >> >> > Any suggestions on the Hadoop product I should start with to
> >> accomplish
> >> >> the
> >> >> > task? HBase,Pig,Hive, or ...?
> >> >> >
> >> >> > Thanks!
> >> >> >
> >> >> > Xueling
> >> >> >
> >> >>
> >> >
> >>
> >
> >
>

Reply via email to