Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-11 Thread Xueling Shu
Hi there: I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set) The queries will be performed against chip sequencing data. Each record is one line in a file. To be clear below shows a sample record in the data

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-11 Thread Todd Lipcon
Hi Xueling, One important question that can really change the answer: How often does the dataset change? Can the changes be merged in in bulk every once in a while, or do you need to actually update them randomly very often? Also, how fast is "quick"? Do you mean 1 minute, 10 seconds, 1 second,

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-12 Thread Xueling Shu
Hi Todd: Thank you for your reply. The datasets wont be updated often. But the query against a data set is frequent. The quicker the query, the better. For example we have done testing on a Mysql database (5 billion records randomly scattered into 24 tables) and the slowest query against the bigg

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-12 Thread Todd Lipcon
Hi Xueling, In that case, I would recommend the following: 1) Put all of your data on HDFS 2) Write a MapReduce job that sorts the data by position of match 3) As a second output of this job, you can write a "sparse index" - basically a set of entries like this: where you're basically giving

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-12 Thread stack
You might also consider hbase, particularly If you find that your data is being updated with some regularity, particularly if the updates are randomly distributed over the data set. See http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulkfor h

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-12 Thread Xueling Shu
Great information! Thank you for your help, Todd. Xueling On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote: > Hi Xueling, > > In that case, I would recommend the following: > > 1) Put all of your data on HDFS > 2) Write a MapReduce job that sorts the data by position of match > 3) As a second

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2009-12-12 Thread Fred Zappert
+1 for hbase On Sat, Dec 12, 2009 at 2:56 PM, Xueling Shu wrote: > Great information! Thank you for your help, Todd. > > Xueling > > On Sat, Dec 12, 2009 at 1:01 PM, Todd Lipcon wrote: > > > Hi Xueling, > > > > In that case, I would recommend the following: > > > > 1) Put all of your data on HDF

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2010-01-05 Thread Xueling Shu
Hi Todd: After finishing some tasks I finally get back to HDFS testing. One question for your last reply to this thread: Are there any code examples close to your second and third recommendations? Or what APIs I should start with for my testing? Thanks. Xueling On Sat, Dec 12, 2009 at 1:01 PM,

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2010-01-05 Thread Xueling Shu
Rephrase the sentence "Or what APIs I should start with for my testing?": I mean "What HDFS APIs I should start to look into for my testing? Thanks, Xueling On Tue, Jan 5, 2010 at 5:24 PM, Xueling Shu wrote: > Hi Todd: > > After finishing some tasks I finally get back to HDFS testing. > > One q

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2010-01-06 Thread Todd Lipcon
Hi Xueling, Here's a general outline: My guess is that your "position of match" field is bounded (perhaps by the number of base pairs in the human genome?) Given this, you can probably write a very simple Partitioner implementation that divides this field into ranges, each with an approximately e

Re: Which Hadoop product is more appropriate for a quick query on a large data set?

2010-01-06 Thread Xueling Shu
Thanks for the information! I will start to try. Xueling On Wed, Jan 6, 2010 at 11:32 AM, Todd Lipcon wrote: > Hi Xueling, > > Here's a general outline: > > My guess is that your "position of match" field is bounded (perhaps by the > number of base pairs in the human genome?) Given this, you ca

RE: Which Hadoop product is more appropriate for a quick query on a large data set?

2010-01-06 Thread Gibbon, Robert, VF-Group
BQL sql-dialect? Sorry, my 10 pence worth! -Original Message- From: Xueling Shu [mailto:x...@systemsbiology.org] Sent: Wed 1/6/2010 8:41 PM To: general@hadoop.apache.org Subject: Re: Which Hadoop product is more appropriate for a quick query on a large data set? Thanks for the information!