On Tue, May 18, 2004 at 02:24:54PM -0700, Doug Cutting wrote: > [EMAIL PROTECTED] wrote: > >Yes, that is the way I do my fetch/search cycles: > >first round fetch text/html only, basically collect as many links as > >possbile > >second round, application/msword, > >third round, application/pdf, > >... > >all can go in parallel, and provide better storage management, > >for pdf, doc are typically much larger than html and > >you do not want to mix them with html in the same segment. > > Why don't you want to mix them? > > Doug
This is an operational issue for me. Html only segments is smaller (or the same size has more entries). It can be easilly carried around to different hosts and start up runs for other formats. Furthermore, in a way, html pages are more valuable than pdf, doc. I can afford losing pdf, doc, but not html, given I have limited amount of reliable storage. John John ------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
