Hello,
Thank you for your efforts. We've also been trying some things and
currently we have implemented a solution that we are satisfied with. It
allows us to instantly start merging any indexing job.
It's a single class that wraps an indexing Job and submits it to the
JobTracker, asynchrono
Hi
I've done some work and thought I'd report back the results (that are
not too encouraging).
Approach 1:
* Mappers output a side effect Lucene Directory (written on-disk) and
a pair where the value is the location of the index on
disk and the key is unimportant for now.
* Reducer merges the on
Very well. Could you keep us informed on how your instant merging plans
work out? We're actually running a similar indexing process.
It's very interesting to be able to start merging Lucene indexes once
the first mappers have finished, instead of waiting until ALL mappers
have finished.
Shai
Well, the current scale does not warrant sharding up the index. 13GB of
data, ~8-10GB index is something a single machine (even not a strong one)
can handle pretty well. And as the data will grow, so will the number of
Mappers, and the number of sub-indexes. So at some point I will need to
merge in
You right about the fact that merging cannot be done by simply
appending. Have you thought about the possibility to actually take
advantage of fact that your final index will be split into several
segments? Especially if you plan to increase the scale of the input
data, you may eventually want
Specifically at the moment, a Mapper output is an index (Lucene), and the
Reducer's job is to take all indexes and merge them down to 1. Searching on
an index w/ hundreds of segments is inefficient. Basically, if you think of
the Reducer as holding the 'final' index (to be passed on in the larger
p
Could you elaborate on how you merge your data? If you have independent
map tasks with single key-value outputs that can be written as soon as
possible, I'm still curious why need to reduce at all. Surely there must
be a way to merge your data 'on read'.
About multi-level reducing; if you merg
Yes, I'm aware of the fact that I can run w/ 0 reducers. However, I do need
my output to be merged, and unfortunately the merge is not really that
simple that I can use 'getmerge'. I think I'll give the Combiner a chance
now - hopefully its reduce() method will get called as Mappers finish their
wo
Just a quick pointer here:
You are aware of the fact that you can configure the number of reduce
tasks to be zero? If I read well, you mention the order of your map
outputs in de merge does not really matter, as well as having a single
value for every key. Perhaps you could eliminate your redu
Without going too much into the detail (so that we keep the discussion
simple) -- I'm processing large volumes of text documents. Currently I'm
training (myself :)) on a 13GB collection, but after I figure out the recipe
for writing the proper mix of Mappers/Reducers/Jobs, I will move to a TB
colle
Shai,
It's hard to determine what the best solution would be without knowing more
about your problem. In general, combiner functions work well but they will be
of little value if each mapper output contains a unique key. This is because
combiner functions only "combine" multiple values associat
Thanks for the prompt response Amogh !
I'm kinda rookie w/ Hadoop, so please forgive my perhaps "too rookie"
questions :).
Check the property mapred.reduce.slowstart.completed.maps
>
>From what I read here (
http://hadoop.apache.org/common/docs/current/mapred-default.html), this
parameter contro
Hi,
>>What would really be great for me is if I could have the Reducer start
>>processing the map outputs as they are ready, and not after all Mappers finish
Check the property mapred.reduce.slowstart.completed.maps
>>I've read about chaining mappers, but to the best of my understanding the
>>se
Hi
I have a scenario for which I'd like to write a MR job in which Mappers do
some work and eventually the output of all mappers need to be combined by a
single Reducer. Each Mapper outputs that is distinct from all
other Mappers, meaning the Reducer.reduce() method always receives a single
eleme
14 matches
Mail list logo