Re: [Nutch-general] Distributed index

Karol Rybak Fri, 22 Jun 2007 11:27:50 -0700


You need space to stored the fetched documents (segments).  Even when
compressed, 100M documents takes a lot of space.



That's what my question was really about, why do I need to keep those
fetched documents? I was thinking that I could remove them right after they
were indexed.

You are going to have crawldb, linkdb, and indexes which effectively doubles

the amount of
space you need.  This will have to be on a DFS because there is no
single machine that can handle this load and because raid at this level
is prohibitively expensive.  On the DFS you are going to replicate your
data blocks at a minimum 3 times for redundancy so you just tripled your
space.



There's a second question coming to my mind: Is 3 times the minimum setting,
or is it what is considered safe. How about using DFS with no redundancy at
all ? Is that a possible setup, i understand that i could lose data that
way, but is that possible ?


You will still need space on the machines for processing the next jobs,

unless you plan to delete all of the databases and start from scratch
every time which isn't advised.  So for sorts and other map reduce job
processing you will want to leave approximately 30% of the space open on
each box.  Depending on the jobs you are running you may need more.

If you are using the same boxes for search servers you will then have to
copy the indexes from the DFS to local which again doubles the space
needed.  The estimate that we use is 100-200G for every 1M pages
indexed.  You probably can get away with 50G per 1M pages but we have
large computational jobs that are running and we don't want to run out
of space.

From what you said, i understand that i would get better performance if i

would split the index to many computers manually (not using DFS) and what i
get from DFS is better failure resistance for my system because of data
redundancy ?


A rough calculation would be ~4G compressed content per 1M pages fetched

initially or 4K compressed per fetched page. So 4G * 2 for crawl, link,
indexing = 8G * 3 for DFS replication = 36G * 1.3 for processing space =
46.8G + 4G for local indexes = 50.8G.



Well that's a nice calculation, however i could imagine reducing that
requirements for my setup.
1. I don't really need that data to be redundant, I can afford losing part
of the index so i could ditch DFS replication.
2. I don't want to store segments after indexing.
3. I would only use local indexes.
So the calculation for 1M would look like
4G crawl, link, index
4G * 1.3 processing space = 5,2 G

other question would be what part of those 4G is taken by index, i think
it's the majority, but i might be very wrong...

You said above that you don't want local storage.  Search has to be on

local file systems.  While you may technically be able to pull a search
result from the DFS you will almost certainly run out of memory and the
search will take an excessively long time (minutes, not subsecond) if it
returns.  Search is a hardware intensive business in part because of the
number of servers that are needed to handle serving large indexes.

If anybody knows of a better way to setup a search architecture than
2-4M pages per index per search server I would love to hear about it.
The former suggestions of space and architecture are what we have
experienced.

Dennis Kubes




Thanks for your patience and answering my questions, but we need to know as
much as possible about the software before actual implementation...

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Distributed index

Reply via email to