Re: [Nutch-dev] PDF parsing speed

2004-10-31 Thread John X
On Sun, Oct 31, 2004 at 03:43:22PM -0500, Luke Baker wrote: > On 10/31/2004 12:22 PM, John X wrote: > [snip] > >What are the numbers for kb/s and bytes/page? > >I have a collection of mostly mswords, ppts and some pdfs, the numbers are > >041001 194517 10 status: 0.17712256 pages/s, 8246.524 kb/s,

Re: [Nutch-dev] PDF parsing speed

2004-10-31 Thread Luke Baker
On 10/31/2004 12:22 PM, John X wrote: [snip] What are the numbers for kb/s and bytes/page? I have a collection of mostly mswords, ppts and some pdfs, the numbers are 041001 194517 10 status: 0.17712256 pages/s, 8246.524 kb/s, 5959461.5 bytes/page Some files are very large in size: 100 - 300 Mbytes.

Re: [Nutch-dev] PDF parsing speed

2004-10-31 Thread John X
On Sat, Oct 30, 2004 at 11:06:18AM -0400, Luke Baker wrote: > Hey all, > > Does anyone else have the problem of the pdf parser taking up so many > resources that it slows down the whole parsing process? I ran the fetch > with the -noParsing option (thanks John!). I then ran the parser on the

[Nutch-dev] PDF parsing speed

2004-10-30 Thread Luke Baker
Hey all, Does anyone else have the problem of the pdf parser taking up so many resources that it slows down the whole parsing process? I ran the fetch with the -noParsing option (thanks John!). I then ran the parser on the documents with the pdf parser enabled. The speed for parsing was quite