[Nutch-general] Re: Question about crawldb and segments

Doug Cutting Wed, 12 Apr 2006 11:25:03 -0700

Jason Camp wrote:

Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).

But to perform the copies, you're using a lot of bandwidth to your"indexing" datacenter, no? Copying segments probably takes almost asmuch bandwidth as fetching them...

    I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) Thanks!

You can use different sets of machines for dfs and MapReduce, bystarting them in differently configured installations. So you could rundfs only in your "indexing" datacenter, and MapReduce in bothdatacenters configured to talk to the same dfs, at the "indexing"datacenter. Then your fetch tasks as the "fetch" datacenter would writetheir output to the "indexing" datacenter's dfs. Andparse/updatedb/generate/index/etc. could all run at the otherdatacenter. Does that make sense?


Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Question about crawldb and segments

Reply via email to