[Nutch-dev] Bad URLs causing SEVERE exception

2005-07-05 Thread Chirag Chaman
Over the weekend the fetcher crashed and kept crashing. The culprit was a site which was pointing to bad links -- http://:80/ and http://:0/ etc. These links were getting thru -- thus we changed the URL filter to only accept valid URL. As someone else may face the same issue, here is the RE -- t

[Nutch-dev] Bad URLs causing SEVERE exception

2005-07-05 Thread Chirag Chaman
Over the weekend the fetcher crashed and kept crashing. The culprit was a site which was pointing to bad links -- http://:80/ and http://:0/ etc. These links were getting thru -- thus we changed the URL filter to only accept valid URL. As someone else may face the same issue, here is the RE -- t

[Nutch-dev] RE: [jira] Commented: (NUTCH-66) Cookies are not being read properly

2005-07-05 Thread Chirag Chaman
Andrzej, This does NOT work. Still complains when it sees the domain name without a leading period. -Original Message- From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED] Sent: Monday, July 04, 2005 12:57 PM To: nutch-dev@incubator.apache.org Subject: [jira] Commented: (NUTCH-66)

[Nutch-dev] RE: [jira] Commented: (NUTCH-66) Cookies are not being read properly

2005-07-05 Thread Chirag Chaman
Andrzej, This does NOT work. Still complains when it sees the domain name without a leading period. -Original Message- From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED] Sent: Monday, July 04, 2005 12:57 PM To: nutch-dev@incubator.apache.org Subject: [jira] Commented: (NUTCH-66)

[Nutch-dev] RE: both html parser have bug with javascript

2005-07-05 Thread Chirag Chaman
Andrzej, Thankx -- This works!!! -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Monday, July 04, 2005 11:55 AM To: nutch-dev@lucene.apache.org Subject: Re: both html parser have bug with javascript Chirag Chaman wrote: > Andrzej, > > Thank you -- and here we

[Nutch-dev] Re: Iterating spidered pages

2005-07-05 Thread Andrzej Bialecki
Andy Liu wrote: However, somebody correct me if I'm wrong, but I don't think you can update individual ArrayFile entries once they've been written. So while you're looping over each ParseData entry, you can write your updated ParseData objects to a temporary ArrayFile and replace it with the ol

[Nutch-dev] Re: LanguageIdentifier refactoring

2005-07-05 Thread Andrzej Bialecki
Jérôme Charron wrote: I think, this is an issue for all detection mechanisms... For the content-type it is the same problem: What is the right value, the one provided by the protocol layer, or the one provided by the extension mapping, or the one provided by the detection (mime-magic)? I thi

[Nutch-dev] Re: Iterating spidered pages

2005-07-05 Thread Andy Liu
You can use a SegmentReader object to give you references to the FetcherOutput, ParseData, and Content objects for each page in the segment. The raw page data is encapsulated within the Content object so you can parse out whatever you want from it. However, somebody correct me if I'm wrong, but I

[Nutch-dev] Re: LanguageIdentifier refactoring

2005-07-05 Thread Jérôme Charron
> I have an issue with the language detection plugin, which I'm not sure > how to address. The plugin first tries to extract the language > identifier from meta tags. However, meta tag values people put there are > often completely wrong, or follow obscure pseudo-standards. > > Example: there is a

[Nutch-dev] Re: LanguageIdentifier refactoring

2005-07-05 Thread Andrzej Bialecki
Jerome, I have an issue with the language detection plugin, which I'm not sure how to address. The plugin first tries to extract the language identifier from meta tags. However, meta tag values people put there are often completely wrong, or follow obscure pseudo-standards. Example: there i

[Nutch-dev] Iterating spidered pages

2005-07-05 Thread Fredrik Andersson
Hi! I'm new to this list, so hello to you all. Here's the gig - I have crawled and indexed a bunch of pages. The HTML Parser used in nutch only parses out the title, text, metadata and outlinks. Is there any way to extend this set of attributes post-crawling (i.e, without rewriting HtmlParser.jav

[Nutch-dev] [jira] Updated: (NUTCH-68) A tool to generate arbitrary fetchlists

2005-07-05 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-68?page=all ] Andrzej Bialecki updated NUTCH-68: --- Attachment: FreeFetchlistTool.java > A tool to generate arbitrary fetchlists > --- > > Key: NUTCH-68 >

[Nutch-dev] [jira] Created: (NUTCH-68) A tool to generate arbitrary fetchlists

2005-07-05 Thread Andrzej Bialecki (JIRA)
A tool to generate arbitrary fetchlists --- Key: NUTCH-68 URL: http://issues.apache.org/jira/browse/NUTCH-68 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned to: Andrzej Bialecki