Re: [Nutch-general] Nutch and distributed searching (w/ apologies)

Dennis Kubes Wed, 01 Aug 2007 13:19:14 -0700

I am currently writing a python script to automate this whole process 
from inject to pushing out to search servers.  It should be done in a 
day or two and I will post it on the wiki.


Dennis Kubes

charlie w wrote:
> Thanks very much for the extended reply; lots of food for thought.
> 
> WRT the merge/index time on a large index, I kind of suspected this might be
> the case.  It's already taking a bit of time (albeit on a weak box) with my
> relatively small index.  In general the approach you outline sounds like
> something I intuitively thought might need to be done, but had no
> real experience to justify that intuition.
> 
> So if I understand you correctly, each iteration of fetching winds up on a
> separate search server, and you're not doing any merging of segments?
> 
> When you eventually get around to recrawling a particular page, do you wind
> up with problems if that page exists in two separate indexes on two separate
> search servers?  For example, we fetch www.foo.com, and that page goes into
> the index on search server 1.  Then, 35 days later, we go back to crawl
> www.foo.com, and this time it winds up in the index on search server 2.
> Wouldn't the two search servers return the same page as a hit to a search?
> If not, what prevents that from being an issue?

You can do a dedeup of results on the search itself.  So yes there are 
duplicates in the different index segments, but you will always be 
returning the "best" pages to the user.
> 
> It also seems that I must be missing something regarding new pages.  If, as
> in step 9, you are replacing the index on a search server, wouldn't you
> possibly create the effect of removing documents from the index?  Say you
> have the same 2 search servers, but do 10 iterations of fetching as a
> "depth" of crawl.  Wouldn't you be replacing the documents in search server
> 1 several times over the course of those 10 iterations?

No because you are updating a single master crawldb and on the next 
iteration it wouldn't grab the same pages, it would grab the next best n 
pages.

> 
> Once again, thanks.
> 
> - Charlie
> 
> 
> On 7/31/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>> It is not a problem to contact me directly if you have questions. I am
>> going to include this post on the mailing list as well in case other
>> people have similar questions.
>>
>> When we originally started (and back when I wrote the tutorial), I
>> thought the best approache would be to have a single massive segments,
>> crawldb, linkdb, and indexes on the dfs.  And if we had this then we
>> would need an index splitter so we split those massive databases to
>> have x number of urls on each search server.  The problem with this
>> approach though is that is doesn't scale very well (beyond about 50M
>> pages).  You have to keep merging whatever you are crawling into your
>> master and after a while this takes a good deal of time to sort, merge
>> continually index.
>>
>> The approach we are using these days is focused on smaller distributed
>> segments and hence indexes.  Here is how it works:
>>
>> 1) Inject your database with a beginning url list and fetch those pages.
>> 2) Update a single master crawl db (at this point you only have one).
>> 3) Do a generate with a -topN option to get the best urls to fetch.  Do
>> this for the number of urls you want on each search server.  A good rule
>> of thumb in no more than 2-3 million pages per disk for searching (this
>> is for web search engines).  So lets say your crawldb once updated from
>> the first run has > 2 million urls, you would do a generate with -topN
>> 2000000.
>> 4) Fetch this new segment through the fetch command.
>> 5) Update the single master crawldb with this new segment.
>> 6) Create a single master linkdb (at this point you will only have one)
>> through the invertlinks command.
>> 7) Index that single fetched segment.
>> 8) Use a script, etc. to push the single index, segments, and linkdb to
>> a search server directory from the dfs.
>> 9) do steps 3-8 for as many search servers as you have. When you reach
>> the number of search servers you have you can replace the indexes, etc.
>> on the first, second, etc. search servers with new fetch cycles.  This
>> way your index always has the best pages for the number of servers and
>> amount of space you have.
>>
>> Once you have a linkdb created, meaning the second or greater fetch,
>> then you would create a linkdb for just the single segments and then use
>> the mergelinkdb command to merge the single into the master linkdb.
>>
>> When pushing the pieces to search servers you can move the entire
>> linkdb, but after a while that is going to get big.  A better way is to
>> write a map reduce job that will split the linkdb to only include urls
>> for the single segment that you have fetched.  Then you would only move
>> that single linkdb piece out, not the entire master linkdb.  If you want
>> to get started quick though just copy the entire linkdb to each search
>> server.
>>
>> This approach assumes that you have a search website fronting multiple
>> search servers (search-servers.txt) and that you can bring down a single
>> search server, update the index and pieces, and then bring the single
>> search server back up.  This way the entire index is never down.
>>
>> Hope this helps and let me know if you have any questions.
>>
>> Dennis Kubes
>>
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch and distributed searching (w/ apologies)

Reply via email to