Karol Rybak wrote:
> Hello, i have some questions about nutch in general. I need to create a
> simple web crawler, however we want to index a lot of documents it'll
> probably be about 100 million in future. I have a couple of servers i
can
100 million pages = 50-100 servers and 20-40T of space distributed.
Ideally the setup would be processing machines and search servers. You
would have say 50 or so processing machines that would handle the
crawling, indexing, mapreduce, and dfs. Then you would have 50 more
somewhat less powered (possibly) servers that handle just serving the
search. You can get away with having the processing and search servers
on the same machines but the search will slow down considerably while
running large jobs.
Hello, thanks for your answer, 20-40T of space seems large, the question is
do you store fetched files, or just indexes ? I don't want to maintain local
storage, i need only indexing...
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general