I am currently writing a python script to automate this whole process from inject to pushing out to search servers. It should be done in a day or two and I will post it on the wiki.
Dennis Kubes charlie w wrote: > Thanks very much for the extended reply; lots of food for thought. > > WRT the merge/index time on a large index, I kind of suspected this might be > the case. It's already taking a bit of time (albeit on a weak box) with my > relatively small index. In general the approach you outline sounds like > something I intuitively thought might need to be done, but had no > real experience to justify that intuition. > > So if I understand you correctly, each iteration of fetching winds up on a > separate search server, and you're not doing any merging of segments? > > When you eventually get around to recrawling a particular page, do you wind > up with problems if that page exists in two separate indexes on two separate > search servers? For example, we fetch www.foo.com, and that page goes into > the index on search server 1. Then, 35 days later, we go back to crawl > www.foo.com, and this time it winds up in the index on search server 2. > Wouldn't the two search servers return the same page as a hit to a search? > If not, what prevents that from being an issue? You can do a dedeup of results on the search itself. So yes there are duplicates in the different index segments, but you will always be returning the "best" pages to the user. > > It also seems that I must be missing something regarding new pages. If, as > in step 9, you are replacing the index on a search server, wouldn't you > possibly create the effect of removing documents from the index? Say you > have the same 2 search servers, but do 10 iterations of fetching as a > "depth" of crawl. Wouldn't you be replacing the documents in search server > 1 several times over the course of those 10 iterations? No because you are updating a single master crawldb and on the next iteration it wouldn't grab the same pages, it would grab the next best n pages. > > Once again, thanks. > > - Charlie > > > On 7/31/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: >> It is not a problem to contact me directly if you have questions. I am >> going to include this post on the mailing list as well in case other >> people have similar questions. >> >> When we originally started (and back when I wrote the tutorial), I >> thought the best approache would be to have a single massive segments, >> crawldb, linkdb, and indexes on the dfs. And if we had this then we >> would need an index splitter so we split those massive databases to >> have x number of urls on each search server. The problem with this >> approach though is that is doesn't scale very well (beyond about 50M >> pages). You have to keep merging whatever you are crawling into your >> master and after a while this takes a good deal of time to sort, merge >> continually index. >> >> The approach we are using these days is focused on smaller distributed >> segments and hence indexes. Here is how it works: >> >> 1) Inject your database with a beginning url list and fetch those pages. >> 2) Update a single master crawl db (at this point you only have one). >> 3) Do a generate with a -topN option to get the best urls to fetch. Do >> this for the number of urls you want on each search server. A good rule >> of thumb in no more than 2-3 million pages per disk for searching (this >> is for web search engines). So lets say your crawldb once updated from >> the first run has > 2 million urls, you would do a generate with -topN >> 2000000. >> 4) Fetch this new segment through the fetch command. >> 5) Update the single master crawldb with this new segment. >> 6) Create a single master linkdb (at this point you will only have one) >> through the invertlinks command. >> 7) Index that single fetched segment. >> 8) Use a script, etc. to push the single index, segments, and linkdb to >> a search server directory from the dfs. >> 9) do steps 3-8 for as many search servers as you have. When you reach >> the number of search servers you have you can replace the indexes, etc. >> on the first, second, etc. search servers with new fetch cycles. This >> way your index always has the best pages for the number of servers and >> amount of space you have. >> >> Once you have a linkdb created, meaning the second or greater fetch, >> then you would create a linkdb for just the single segments and then use >> the mergelinkdb command to merge the single into the master linkdb. >> >> When pushing the pieces to search servers you can move the entire >> linkdb, but after a while that is going to get big. A better way is to >> write a map reduce job that will split the linkdb to only include urls >> for the single segment that you have fetched. Then you would only move >> that single linkdb piece out, not the entire master linkdb. If you want >> to get started quick though just copy the entire linkdb to each search >> server. >> >> This approach assumes that you have a search website fronting multiple >> search servers (search-servers.txt) and that you can bring down a single >> search server, update the index and pieces, and then bring the single >> search server back up. This way the entire index is never down. >> >> Hope this helps and let me know if you have any questions. >> >> Dennis Kubes >> > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
