Re: [Nutch-general] Distributed index

Dennis Kubes Fri, 22 Jun 2007 13:18:28 -0700


Karol Rybak wrote:
>>
>> You need space to stored the fetched documents (segments).  Even when
>> compressed, 100M documents takes a lot of space.
> 
> 
> That's what my question was really about, why do I need to keep those
> fetched documents? I was thinking that I could remove them right after they
> were indexed.


Segments are used to create the summaries you see below the search 
results.  The linkdb is used to find pages that link to the page.  If 
you don't want the summaries you can remove the segments.
> 
> You are going to have crawldb, linkdb, and indexes which effectively 
> doubles
>> the amount of
>> space you need.  This will have to be on a DFS because there is no
>> single machine that can handle this load and because raid at this level
>> is prohibitively expensive.  On the DFS you are going to replicate your
>> data blocks at a minimum 3 times for redundancy so you just tripled your
>> space.
> 
> 
> There's a second question coming to my mind: Is 3 times the minimum 
> setting,
> or is it what is considered safe. How about using DFS with no redundancy at
> all ? Is that a possible setup, i understand that i could lose data that
> way, but is that possible ?
> 
I believe it is possible to have zero replication although we have never 
tested that type of setup.  The downside here is that if one thing goes 
wrong you could lose all of your work.  I can tell you that the DFS has 
saved us multiple times.
> 
> You will still need space on the machines for processing the next jobs,
>> unless you plan to delete all of the databases and start from scratch
>> every time which isn't advised.  So for sorts and other map reduce job
>> processing you will want to leave approximately 30% of the space open on
>> each box.  Depending on the jobs you are running you may need more.
>>
>> If you are using the same boxes for search servers you will then have to
>> copy the indexes from the DFS to local which again doubles the space
>> needed.  The estimate that we use is 100-200G for every 1M pages
>> indexed.  You probably can get away with 50G per 1M pages but we have
>> large computational jobs that are running and we don't want to run out
>> of space.
> 
> 
>  From what you said, i understand that i would get better performance if i
> would split the index to many computers manually (not using DFS) and what i
> get from DFS is better failure resistance for my system because of data
> redundancy ?

Right, searching a local file system avoids the network overhead and 
therefore has much better performance.  Downside is that after the 
indexes are created on the DFS they do need to be copied to the local 
file systems along with the segments and linkdb if you are using those 
as stated above.
> 
> 
> A rough calculation would be ~4G compressed content per 1M pages fetched
>> initially or 4K compressed per fetched page. So 4G * 2 for crawl, link,
>> indexing = 8G * 3 for DFS replication = 36G * 1.3 for processing space =
>> 46.8G + 4G for local indexes = 50.8G.
> 
> 
> Well that's a nice calculation, however i could imagine reducing that
> requirements for my setup.
> 1. I don't really need that data to be redundant, I can afford losing part
> of the index so i could ditch DFS replication.

100M pages depending on bandwidth can take quite a while to fetch and 
index.  Make sure that you really are willing to lose that time if 
something goes wrong.  If you are then yes it doesn't have to be redundant.

> 2. I don't want to store segments after indexing.

As above, are you using summaries, etc.

> 3. I would only use local indexes.
> So the calculation for 1M would look like
> 4G crawl, link, index
> 4G * 1.3 processing space = 5,2 G

As an example, one of our splits contains segments, linkdb, and indexes 
on local disks:

3039536082 indexes 2.8G
94417902 linkdb 90M
14613198286 segments 13.6G
---------------------------
17747152270 total 16.5G

This split contains about 1.64M pages currently.  So my earlier 
calculations may be somewhat low on the segment space.

Dennis Kubes

> 
> other question would be what part of those 4G is taken by index, i think
> it's the majority, but i might be very wrong...
> 
> You said above that you don't want local storage.  Search has to be on
>> local file systems.  While you may technically be able to pull a search
>> result from the DFS you will almost certainly run out of memory and the
>> search will take an excessively long time (minutes, not subsecond) if it
>> returns.  Search is a hardware intensive business in part because of the
>> number of servers that are needed to handle serving large indexes.
>>
>> If anybody knows of a better way to setup a search architecture than
>> 2-4M pages per index per search server I would love to hear about it.
>> The former suggestions of space and architecture are what we have
>> experienced.
>>
>> Dennis Kubes
> 
> 
> 
> Thanks for your patience and answering my questions, but we need to know as
> much as possible about the software before actual implementation...
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Distributed index

Reply via email to