Re: [Nutch-dev] ArrayIndexOutOfBoundsException during fetch

2005-01-29 Thread Piotr Kosiorowski
Hello, I am attaching the patch for RegexUrlNormalizer and RegexUrlFilter - it should reduce synchronization of threads during fetching. I took me quite long to do it as I was quite busy at work but finally I did it - I tested it downloading about 1mln URLs in 200 fetcher threads and it was run

[Nutch-dev] indexing and updating the db is very slow

2005-01-29 Thread Brandon Purcell
I recently moved to 0.6 and for some reason everything is running much slower when I update and index the db.   I recently merged all of my segments into one (there were about 12) and they contain approximately 630,000 total pages. I ran a fetch of 100,000 more pages and when I went to updated the

[Nutch-dev] need a lib to know location of html element

2005-01-29 Thread John X
Hi, All, I need a lib/tool that can tell me physical location of a particular html element as the page would have been displayed by a browser. It would be the best if in java, but I am open to ones in other languages. Commercial or not, that's fine. Any help/info/recommendation is greatly appreci

[Nutch-dev] PayPal Verification

2005-01-29 Thread PayPal . com
Dear valued PayPal® member: It has come to our attention that your PayPal® account information needs to be updated as part of our continuing commitment to protect your account and to reduce the instance of fraud on our website. If you could please take 5-10 minutes out of your online exper

Re: [Nutch-dev] get unparsed content

2005-01-29 Thread Stefan Groschupf
John, thanks a lot for this clarification! Cheers, Stefan Am 30.01.2005 um 00:45 schrieb John X: On Sat, Jan 29, 2005 at 11:09:05PM +0100, Stefan Groschupf wrote: Hi there, I would love to use the raw content of a fetched page for some post processing. After browsing the code I'm a little bit confu

Re: [Nutch-dev] get unparsed content

2005-01-29 Thread John X
On Sat, Jan 29, 2005 at 11:09:05PM +0100, Stefan Groschupf wrote: > Hi there, > > I would love to use the raw content of a fetched page for some post > processing. > After browsing the code I'm a little bit confused. :-) > The raw content of a page will be stored in any case in a content array >

[Nutch-dev] get unparsed content

2005-01-29 Thread Stefan Groschupf
Hi there, I would love to use the raw content of a fetched page for some post processing. After browsing the code I'm a little bit confused. :-) The raw content of a page will be stored in any case in a content array file. Isn't it? Now I'm wondering what is stored in 'fetcher_output', since the