Re: [Nutch-general] Distributed index

Karol Rybak Fri, 22 Jun 2007 14:01:32 -0700

On 6/22/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:

Karol Rybak wrote:
>>
>> You need space to stored the fetched documents (segments).  Even when
>> compressed, 100M documents takes a lot of space.
>
>
> That's what my question was really about, why do I need to keep those
> fetched documents? I was thinking that I could remove them right after
they
> were indexed.

Segments are used to create the summaries you see below the search
results.  The linkdb is used to find pages that link to the page.  If
you don't want the summaries you can remove the segments.



Well that's great. I will not actually use nutch search system because we do
not need it, we're going to use lucene queries. All the queries will be
software generated.

3039536082 indexes 2.8G
94417902 linkdb 90M
14613198286 segments 13.6G
---------------------------
17747152270 total 16.5G



Thanks for providing that info, it saved me lot of time, from what i can see
majority of disk space is used by segments, linkdb is not a problem and
indexes are not so huge :)

From my calculations it seems that indexes for 100 million documents would

take up about 170 GB of disk space, and that's really small, so i will be
able to store redundant data anyway. Thanks for all your help. It seems that
you've just gotten another nutch user :)

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Distributed index

Reply via email to