Aaron makes lots of sense when he says that there are better ways to do this lookup without making your mappers depend on each other.
But having a hadoop cluster slam a mysql farm with queries is asking for trouble (I have tried it). Hadoop mappers can saturate a mysql database so *very* hard that it is a thing to behold. There are lots of other options. The idea of using Zookeeper to spawn a special lookup thread on each machine isn't so bad, although I would avoid RMI like the plague, prefering Thrift or something similar. Having the program that launches the map-reduce program launch a lookup cluster isn't a bad option either (but it isn't as simple as just starting the map-reduce program). Another option is to use a lookup system that depends on the file system cache for memory residency of the lookup table. I would strongly recommend exploring a pure map-reduce solution to the problem. Try joining your lookup table to your map data using a preliminary map-reduce step. This is very easily done if you have a single lookup per map invocation. If you have a number of lookups, then pass through your data producing lookup keys each with pointers back to your original record keys, pass through your lookup table generating key value pairs. Reduce on lookup key and emit original key + key/value pair from the lookup table. Make sure you eliminate duplicates key/value pairs at this point. Reduce that against your original data and now you have your original data with all of the data records that the mapper needs all in one place. You are now set to go with your original problem except the lookup operation has been done ahead of time. This sounds outrageously expensive, but because all disk I/O is sequential it can be surprisingly fast even when the intermediate data steps are quite large. On Thu, Mar 19, 2009 at 8:46 PM, Aaron Kimball <aa...@cloudera.com> wrote: > > Are you using multiple machines for your processing? Rolling your own RMI > service to provide data to your other system seems like asking for tricky > bugs. Why not just put the dictionary terms into a mysql database? Your > mappers could then select against this database, pulling in data > incrementally, and discarding data they don't need. If you configured > memcached (like Jim suggests), then you can even get some memory-based > performance boosts too by sharing common reads. > > -- Ted Dunning, CTO DeepDyve