Re: [Nutch-dev] code for index content of mime type beyond text/html

john Tue, 18 May 2004 15:12:41 -0700

On Tue, May 18, 2004 at 02:24:54PM -0700, Doug Cutting wrote:
> [EMAIL PROTECTED] wrote:
> >Yes, that is the way I do my fetch/search cycles:
> >first round fetch text/html only, basically collect as many links as 
> >possbile
> >second round, application/msword,
> >third round, application/pdf,
> >...
> >all can go in parallel, and provide better storage management,
> >for pdf, doc are typically much larger than html and
> >you do not want to mix them with html in the same segment.
> 
> Why don't you want to mix them?
> 
> Doug


This is an operational issue for me.
Html only segments is smaller (or the same size has more entries).
It can be easilly carried around to different hosts and start up
runs for other formats. Furthermore, in a way, html pages are
more valuable than pdf, doc. I can afford losing pdf, doc, but not html,
given I have limited amount of reliable storage.

John


John


-------------------------------------------------------
This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] code for index content of mime type beyond text/html

Reply via email to