They're designed to take a few minutes and seem to in operations here
and at Yahoo. Details, of course, will vary depending on data volumes
and hardware. More benchmarks welcome. :)
--Ari
On Mon, Jul 20, 2009 at 3:04 AM, zsongbo wrote:
> Hi Ari,
>
> Thanks.
> In Chukwa, how about the performance
Hi Ari,
Thanks.
In Chukwa, how about the performance of the MapReduce jobs for merge.
The 1-hour merge and 1-day merge mapreduce jobs would run simultaneously,
how about the performance?
Schubert
On Sat, Jul 11, 2009 at 7:46 AM, Ariel Rabkin wrote:
> Chukwa uses a mapreduce job for this, with a
Chukwa uses a mapreduce job for this, with a daemon process to
identify the files to be merged. It's unfortunately not as generic as
it could be; it assumes a lot about the way the directory structure is
laid out and the files are named.
I've been tempted to rewrite this to be more generic. But i
Ariel,
Coud you please put more detail about how Chukwa merge 10min files -> 1hour
files-> 1day files?
1. Is it run a background process/thread to do the merge periodically? How
about the performance?
2. How about to run a MapReduce Job to do the merge periodically? How about
the performance?
Sc
You are basically re-inventing lots of capabilities that others have solved
before.
The idea of building an index that refers to files which are constructed by
progressive merging is very standard and very similar to the way that Lucene
works.
You don't say how much data you are moving, but I wou
Chukwa does basically what you describe.
We run a small job every 10 minutes, and then merge the results
periodically. But we don't do indexing.
--Ari
On Mon, Jul 6, 2009 at 7:03 PM, zsongbo wrote:
> Hi all,
>
> We have buildup a system which use hadoop MapRdeuce to sort and index the
> input fi
Hi all,
We have buildup a system which use hadoop MapRdeuce to sort and index the
input files. The index is straightforward blocked-key=>files+offsets.
Then we can query the dataset with low lentacy.
Usually, we run the MapReduce jobs in periodic of one day or hours. Then
the data before one day