On 6/22/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > > Karol Rybak wrote: > >> > >> > >> Karol Rybak wrote: > >> > Hello, i have some questions about nutch in general. I need to create a > >> > simple web crawler, however we want to index a lot of documents it'll > >> > probably be about 100 million in future. I have a couple of servers i > >> can > >> > >> 100 million pages = 50-100 servers and 20-40T of space distributed. > >> Ideally the setup would be processing machines and search servers. You > >> would have say 50 or so processing machines that would handle the > >> crawling, indexing, mapreduce, and dfs. Then you would have 50 more > >> somewhat less powered (possibly) servers that handle just serving the > >> search. You can get away with having the processing and search servers > >> on the same machines but the search will slow down considerably while > >> running large jobs. > > > > > > > > Hello, thanks for your answer, 20-40T of space seems large, the question is > > do you store fetched files, or just indexes ? I don't want to maintain > > local > > storage, i need only indexing... > > > > You need space to stored the fetched documents (segments). Even when > compressed, 100M documents takes a lot of space. You are going to have > crawldb, linkdb, and indexes which effectively doubles the amount of > space you need. This will have to be on a DFS because there is no > single machine that can handle this load and because raid at this level > is prohibitively expensive. On the DFS you are going to replicate your > data blocks at a minimum 3 times for redundancy so you just tripled your > space. > > You will still need space on the machines for processing the next jobs, > unless you plan to delete all of the databases and start from scratch > every time which isn't advised. So for sorts and other map reduce job > processing you will want to leave approximately 30% of the space open on > each box. Depending on the jobs you are running you may need more. > > If you are using the same boxes for search servers you will then have to > copy the indexes from the DFS to local which again doubles the space > needed. The estimate that we use is 100-200G for every 1M pages > indexed. You probably can get away with 50G per 1M pages but we have > large computational jobs that are running and we don't want to run out > of space. > > A rough calculation would be ~4G compressed content per 1M pages fetched > initially or 4K compressed per fetched page. So 4G * 2 for crawl, link, > indexing = 8G * 3 for DFS replication = 36G * 1.3 for processing space = > 46.8G + 4G for local indexes = 50.8G. > > You said above that you don't want local storage. Search has to be on > local file systems. While you may technically be able to pull a search > result from the DFS you will almost certainly run out of memory and the > search will take an excessively long time (minutes, not subsecond) if it > returns. Search is a hardware intensive business in part because of the > number of servers that are needed to handle serving large indexes.
Actually as long as indexes are on local machines, fetching summaries from DFS is not that slow(probably less than 5 seconds). Obviously, also storing them locally improves performance(to subsecond levels). > > If anybody knows of a better way to setup a search architecture than > 2-4M pages per index per search server I would love to hear about it. > The former suggestions of space and architecture are what we have > experienced. > > Dennis Kubes > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
