On Sun, Oct 31, 2004 at 03:43:22PM -0500, Luke Baker wrote: > On 10/31/2004 12:22 PM, John X wrote: > [snip] > >What are the numbers for kb/s and bytes/page? > >I have a collection of mostly mswords, ppts and some pdfs, the numbers are > >041001 194517 10 status: 0.17712256 pages/s, 8246.524 kb/s, 5959461.5 > >bytes/page > >Some files are very large in size: 100 - 300 Mbytes. > >Not sure if other pdf libs will be better that PDFBox. > >You can always separate crawl of html from ones of word, ppt, pdf, etc. > > Here's what I got when I just generated a fetchlist of only pdf files. > I ran this on a dual 1Ghz w/ 1GB of RAM. > segment 20041031150402, 400 pages, 68 errors, 69544345 bytes, 1308956 ms > status: 0.30558702 pages/s, 415.0752 kb/s, 173860.86 bytes/page > > The whole time both CPUs were pegged. Depending on how many pdfs are > spread out through your fetchlist, you can see how this can easily slow > down everything else. It looks like I'll have to do without pdf support > or have those in separate crawls.
By the way, how many threads did you use? My tests show that the best is to set thread number to cpu number, in your case, 2. The default in nutch is 10, I think. John ------------------------------------------------------- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
