Jason Camp wrote:
Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).

But to perform the copies, you're using a lot of bandwidth to your "indexing" datacenter, no? Copying segments probably takes almost as much bandwidth as fetching them...

    I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) Thanks!

You can use different sets of machines for dfs and MapReduce, by starting them in differently configured installations. So you could run dfs only in your "indexing" datacenter, and MapReduce in both datacenters configured to talk to the same dfs, at the "indexing" datacenter. Then your fetch tasks as the "fetch" datacenter would write their output to the "indexing" datacenter's dfs. And parse/updatedb/generate/index/etc. could all run at the other datacenter. Does that make sense?

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to