Hi Xueling, Here's a general outline:
My guess is that your "position of match" field is bounded (perhaps by the number of base pairs in the human genome?) Given this, you can probably write a very simple Partitioner implementation that divides this field into ranges, each with an approximately equal number of records. Next, write a simple MR job which takes in a line of data, and outputs the same line, but with the position-of-match as the key. This will get partitioned by the above function, so you end up with each reducer receiving all of the records in a given range. In the reducer, simply output every 1000th position into your "sparse" output file (along with the non-sparse output file offset), and every position into the non-sparse output file. In your realtime query server (not part of Hadoop), load the "sparse" file into RAM and perform binary search, etc - find the "bins" which the range endpoints land in, and then open the non-sparse output on HDFS to finish the count. Hope that helps. Thanks -Todd On Tue, Jan 5, 2010 at 5:26 PM, Xueling Shu <x...@systemsbiology.org> wrote: > Rephrase the sentence "Or what APIs I should start with for my testing?": I > mean "What HDFS APIs I should start to look into for my testing? > > Thanks, > Xueling > > On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu <x...@systemsbiology.org> > wrote: > > > Hi Todd: > > > > After finishing some tasks I finally get back to HDFS testing. > > > > One question for your last reply to this thread: Are there any code > > examples close to your second and third recommendations? Or what APIs I > > should start with for my testing? > > > > Thanks. > > Xueling > > > > > > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon <t...@cloudera.com> wrote: > > > >> Hi Xueling, > >> > >> In that case, I would recommend the following: > >> > >> 1) Put all of your data on HDFS > >> 2) Write a MapReduce job that sorts the data by position of match > >> 3) As a second output of this job, you can write a "sparse index" - > >> basically a set of entries like this: > >> > >> <position of match> <offset into file> <number of entries following> > >> > >> where you're basically giving offsets into every 10K records or so. If > >> you index every 10K records, then 5 billion total will mean 100,000 > >> index entries. Each index entry shouldn't be more than 20 bytes, so > >> 100,000 entries will be 2MB. This is super easy to fit into memory. > >> (you could probably index every 100th record instead and end up with > >> 200MB, still easy to fit in memory) > >> > >> Then to satisfy your count-range query, you can simply scan your > >> in-memory sparse index. Some of the indexed blocks will be completely > >> included in the range, in which case you just add up the "number of > >> entries following" column. The start and finish block will be > >> partially covered, so you can use the file offset info to load that > >> file off HDFS, start reading at that offset, and finish the count. > >> > >> Total time per query should be <100ms no problem. > >> > >> -Todd > >> > >> On Sat, Dec 12, 2009 at 10:38 AM, Xueling Shu <x...@systemsbiology.org> > >> wrote: > >> > Hi Todd: > >> > > >> > Thank you for your reply. > >> > > >> > The datasets wont be updated often. But the query against a data set > is > >> > frequent. The quicker the query, the better. For example we have done > >> > testing on a Mysql database (5 billion records randomly scattered into > >> 24 > >> > tables) and the slowest query against the biggest table (400,000,000 > >> > records) is around 12 mins. So if using any Hadoop product can speed > up > >> the > >> > search then the product is what we are looking for. > >> > > >> > Cheers, > >> > Xueling > >> > > >> > On Fri, Dec 11, 2009 at 7:34 PM, Todd Lipcon <t...@cloudera.com> > wrote: > >> > > >> >> Hi Xueling, > >> >> > >> >> One important question that can really change the answer: > >> >> > >> >> How often does the dataset change? Can the changes be merged in in > >> >> bulk every once in a while, or do you need to actually update them > >> >> randomly very often? > >> >> > >> >> Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 > second, > >> or > >> >> 10ms? > >> >> > >> >> Thanks > >> >> -Todd > >> >> > >> >> On Fri, Dec 11, 2009 at 7:19 PM, Xueling Shu < > x...@systemsbiology.org> > >> >> wrote: > >> >> > Hi there: > >> >> > > >> >> > I am researching Hadoop to see which of its products suits our need > >> for > >> >> > quick queries against large data sets (billions of records per set) > >> >> > > >> >> > The queries will be performed against chip sequencing data. Each > >> record > >> >> is > >> >> > one line in a file. To be clear below shows a sample record in the > >> data > >> >> set. > >> >> > > >> >> > > >> >> > one line (record) looks like: 1-1-174-418 > TGTGTCCCTTTGTAATGAATCACTATC > >> U2 > >> >> 0 0 > >> >> > 1 4 *103570835* F .. 23G 24C > >> >> > > >> >> > The highlighted field is called "position of match" and the query > we > >> are > >> >> > interested in is the # of sequences in a certain range of this > >> "position > >> >> of > >> >> > match". For instance the range can be "position of match" > 200 and > >> >> > "position of match" + 36 < 200,000. > >> >> > > >> >> > Any suggestions on the Hadoop product I should start with to > >> accomplish > >> >> the > >> >> > task? HBase,Pig,Hive, or ...? > >> >> > > >> >> > Thanks! > >> >> > > >> >> > Xueling > >> >> > > >> >> > >> > > >> > > > > >