Re: How to make data available in 10 minutes.

2009-07-22 Thread Ariel Rabkin
They're designed to take a few minutes and seem to in operations here and at Yahoo. Details, of course, will vary depending on data volumes and hardware. More benchmarks welcome. :) --Ari On Mon, Jul 20, 2009 at 3:04 AM, zsongbo wrote: > Hi Ari, > > Thanks. > In Chukwa, how about the performance

Re: How to make data available in 10 minutes.

2009-07-20 Thread zsongbo
Hi Ari, Thanks. In Chukwa, how about the performance of the MapReduce jobs for merge. The 1-hour merge and 1-day merge mapreduce jobs would run simultaneously, how about the performance? Schubert On Sat, Jul 11, 2009 at 7:46 AM, Ariel Rabkin wrote: > Chukwa uses a mapreduce job for this, with a

Re: How to make data available in 10 minutes.

2009-07-10 Thread Ariel Rabkin
Chukwa uses a mapreduce job for this, with a daemon process to identify the files to be merged. It's unfortunately not as generic as it could be; it assumes a lot about the way the directory structure is laid out and the files are named. I've been tempted to rewrite this to be more generic. But i

Re: How to make data available in 10 minutes.

2009-07-10 Thread zsongbo
Ariel, Coud you please put more detail about how Chukwa merge 10min files -> 1hour files-> 1day files? 1. Is it run a background process/thread to do the merge periodically? How about the performance? 2. How about to run a MapReduce Job to do the merge periodically? How about the performance? Sc

Re: How to make data available in 10 minutes.

2009-07-09 Thread Ted Dunning
You are basically re-inventing lots of capabilities that others have solved before. The idea of building an index that refers to files which are constructed by progressive merging is very standard and very similar to the way that Lucene works. You don't say how much data you are moving, but I wou

Re: How to make data available in 10 minutes.

2009-07-09 Thread Ariel Rabkin
Chukwa does basically what you describe. We run a small job every 10 minutes, and then merge the results periodically. But we don't do indexing. --Ari On Mon, Jul 6, 2009 at 7:03 PM, zsongbo wrote: > Hi all, > > We have buildup a system which use hadoop MapRdeuce to sort and index the > input fi

How to make data available in 10 minutes.

2009-07-06 Thread zsongbo
Hi all, We have buildup a system which use hadoop MapRdeuce to sort and index the input files. The index is straightforward blocked-key=>files+offsets. Then we can query the dataset with low lentacy. Usually, we run the MapReduce jobs in periodic of one day or hours. Then the data before one day