Karol Rybak wrote:
> Hello, i have some questions about nutch in general. I need to create a
> simple web crawler, however we want to index a lot of documents it'll
> probably be about 100 million in future. I have a couple of servers i can

100 million pages = 50-100 servers and 20-40T of space distributed. 
Ideally the setup would be processing machines and search servers.  You 
would have say 50 or so processing machines that would handle the 
crawling, indexing, mapreduce, and dfs.  Then you would have 50 more 
somewhat less powered (possibly) servers that handle just serving the 
search.  You can get away with having the processing and search servers 
on the same machines but the search will slow down considerably while 
running large jobs.

> use. I wanted to distribute the index between those computers, ideally i
> want one computer to crawl the web and fetch pages. Then dedicated indexing
> machines would take a subset from fetched pages and index them, then remove
> the fetched files to save disk space. Search would be performed on  those
> machines and then results would be combined. I realise that it will take a
> lot of customisation. I was thinking about ndfs, but this will only utilize
> disk space of my servers, and not the processors, and i need as fast search
> results as possible. So here's a couple of questions:


> 1. How difficult would it be to make nutch fetcher to split fetched data to
> some sections i can then index on different computers. Is it worth, or 
> maybe

The way we approach it is to use a custom split indexer that indexes 
into multiple index "splits" at once.  Those sections are then moved 
from the DFS to the individual search servers through scripts.  We are 
working on a most customizable index splitter and will release that when 
finished.

> i would be better off writing a crawler from scratch ?

You don't need to write the crawler from scratch, Nutch already has that 
and it work very well.  You can extend it for custom parsing, etc.

> 2. How much processor intensive is the searching ? Maybe ndfs would be good
> enough ?

Search must be done on local drives not DFS for performance.  You would 
use DFS to create the indexes through mapreduce jobs but those indexes 
would then need to be moved to local file system for searching.

A generic setup would be the search website with a search-servers.txt 
file pointing to all of your search servers.  Each search server points 
to a slice or piece (its split) of the entire index on the local drive.

> 3. This should probably go to lucene mailing list, but how well lucene
> handles huge indexes, i have 4 150GB drives, and they will get filled up
> with indexed data someday (full text of documents will be indexed).  Will
> lucene handle that ?

Lucene is great.  You will hit IO performance issue before you hit 
Lucene performance issues (depending on the type of searches you are 
doing).  I believe our current index is around 1 terabyte unsplit and no 
including segments, etc.

> 4. The systems will be searching a lot so i need fast response times, 1-2
> seconds is acceptable, but not more (is that possible with that kind of
> setup ?). Search queries will be simple, no fancy similiarity or 
> complicated
> boolean searches.

This is all dependent on the size of each local index.  Approximately 
2-4M pages per index split is good.  Over that you may see performance 
decreases.  Scaling that out over many servers you will see almost 
linear response time.  We have almost 100M pages in the index and are 
seeing subsecond response times on most queries.

> 5. Any thoughts or suggestions are welcome.

Remember that a search engine on the scale you are talking about is a 
hardware intensive operation.  Use ECC memory on all of your machines. 
Make sure you have sufficient bandwidth for fetching 30-50Mbps.

General setup is MapReduce and DFS run jobs to create indexes.  Index 
splits are then moved to local search servers on multiple machines. 
Each index split is 2-4M pages on each search server.  Search servers 
are started.  Website is pointed at search servers through 
search-servers.txt file.  More search servers can be added later for 
more pages.

Good configs for processing machines are 1U, Intel Core2 Duo, 4G ECC 
RAM, 750G hard drive.  Each machine should use < 1Amp of power so a full 
rack will need approximately 60 Amps of power at the 80% target rates 
for most data centers.  Machines that are only search servers can use 
the above config with 100G hard drives (less if you can find them).

Network is ideally Gigabit ethernet.

Dennis Kubes
> 
> 
> Karol Rybak
> Programmer
> University of Internet Technology and Management
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to