Re: [Nutch-general] Distributed index

Doğacan Güney Fri, 22 Jun 2007 06:47:20 -0700

On 6/22/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>
>
> Karol Rybak wrote:
> >>
> >>
> >> Karol Rybak wrote:
> >> > Hello, i have some questions about nutch in general. I need to create a
> >> > simple web crawler, however we want to index a lot of documents it'll
> >> > probably be about 100 million in future. I have a couple of servers i
> >> can
> >>
> >> 100 million pages = 50-100 servers and 20-40T of space distributed.
> >> Ideally the setup would be processing machines and search servers.  You
> >> would have say 50 or so processing machines that would handle the
> >> crawling, indexing, mapreduce, and dfs.  Then you would have 50 more
> >> somewhat less powered (possibly) servers that handle just serving the
> >> search.  You can get away with having the processing and search servers
> >> on the same machines but the search will slow down considerably while
> >> running large jobs.
> >
> >
> >
> > Hello, thanks for your answer, 20-40T of space seems large, the question is
> > do you store fetched files, or just indexes ? I don't want to maintain
> > local
> > storage, i need only indexing...
> >
>
> You need space to stored the fetched documents (segments).  Even when
> compressed, 100M documents takes a lot of space.  You are going to have
> crawldb, linkdb, and indexes which effectively doubles the amount of
> space you need.  This will have to be on a DFS because there is no
> single machine that can handle this load and because raid at this level
> is prohibitively expensive.  On the DFS you are going to replicate your
> data blocks at a minimum 3 times for redundancy so you just tripled your
> space.
>
> You will still need space on the machines for processing the next jobs,
> unless you plan to delete all of the databases and start from scratch
> every time which isn't advised.  So for sorts and other map reduce job
> processing you will want to leave approximately 30% of the space open on
> each box.  Depending on the jobs you are running you may need more.
>
> If you are using the same boxes for search servers you will then have to
> copy the indexes from the DFS to local which again doubles the space
> needed.  The estimate that we use is 100-200G for every 1M pages
> indexed.  You probably can get away with 50G per 1M pages but we have
> large computational jobs that are running and we don't want to run out
> of space.
>
> A rough calculation would be ~4G compressed content per 1M pages fetched
> initially or 4K compressed per fetched page. So 4G * 2 for crawl, link,
> indexing = 8G * 3 for DFS replication = 36G * 1.3 for processing space =
> 46.8G + 4G for local indexes = 50.8G.
>
> You said above that you don't want local storage.  Search has to be on
> local file systems.  While you may technically be able to pull a search
> result from the DFS you will almost certainly run out of memory and the
> search will take an excessively long time (minutes, not subsecond) if it
> returns.  Search is a hardware intensive business in part because of the
> number of servers that are needed to handle serving large indexes.


Actually as long as indexes are on local machines, fetching summaries
from DFS is not that slow(probably less than 5 seconds). Obviously,
also storing them locally improves performance(to subsecond levels).

>
> If anybody knows of a better way to setup a search architecture than
> 2-4M pages per index per search server I would love to hear about it.
> The former suggestions of space and architecture are what we have
> experienced.
>
> Dennis Kubes
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Distributed index

Reply via email to