thanks Ken for the reply.
well, that's what I am doing right now. But the output from mappers
needs to be processed together. Due to the nature of the problem,
sorting is trivial once the map output becomes available. That's why I
don't want to spend time in hadoop's inbuilt sort that involves dis
If you don't need sorted input, then you probably don't even need a
reducer. Try putting all your functionality in the mapper and then set
reduce tasks to zero.
On Thursday, July 29, 2010, juber patel wrote:
> Hi,
>
> Is it possible to use hadoop and not use disk i/o, apart from the
> initial inp
Hi,
Is it possible to use hadoop and not use disk i/o, apart from the
initial input?
I am asking this with the assumption that disk i/o is the bottleneck
in overall processing, even more than the network access if you are on
a dedicated, high speed cluster. (Does anyone have experience to
confirm
Very well. Could you keep us informed on how your instant merging plans
work out? We're actually running a similar indexing process.
It's very interesting to be able to start merging Lucene indexes once
the first mappers have finished, instead of waiting until ALL mappers
have finished.
Shai
Well, the current scale does not warrant sharding up the index. 13GB of
data, ~8-10GB index is something a single machine (even not a strong one)
can handle pretty well. And as the data will grow, so will the number of
Mappers, and the number of sub-indexes. So at some point I will need to
merge in
You right about the fact that merging cannot be done by simply
appending. Have you thought about the possibility to actually take
advantage of fact that your final index will be split into several
segments? Especially if you plan to increase the scale of the input
data, you may eventually want
Specifically at the moment, a Mapper output is an index (Lucene), and the
Reducer's job is to take all indexes and merge them down to 1. Searching on
an index w/ hundreds of segments is inefficient. Basically, if you think of
the Reducer as holding the 'final' index (to be passed on in the larger
p
Could you elaborate on how you merge your data? If you have independent
map tasks with single key-value outputs that can be written as soon as
possible, I'm still curious why need to reduce at all. Surely there must
be a way to merge your data 'on read'.
About multi-level reducing; if you merg
Yes, I'm aware of the fact that I can run w/ 0 reducers. However, I do need
my output to be merged, and unfortunately the merge is not really that
simple that I can use 'getmerge'. I think I'll give the Combiner a chance
now - hopefully its reduce() method will get called as Mappers finish their
wo
Just a quick pointer here:
You are aware of the fact that you can configure the number of reduce
tasks to be zero? If I read well, you mention the order of your map
outputs in de merge does not really matter, as well as having a single
value for every key. Perhaps you could eliminate your redu
10 matches
Mail list logo