Over the weekend the fetcher crashed and kept crashing. The culprit was a
site which was pointing to bad links -- http://:80/ and http://:0/ etc.
These links were getting thru -- thus we changed the URL filter to only
accept valid URL.
As someone else may face the same issue, here is the RE -- t
Over the weekend the fetcher crashed and kept crashing. The culprit was a
site which was pointing to bad links -- http://:80/ and http://:0/ etc.
These links were getting thru -- thus we changed the URL filter to only
accept valid URL.
As someone else may face the same issue, here is the RE -- t
Andrzej,
This does NOT work.
Still complains when it sees the domain name without a leading period.
-Original Message-
From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Monday, July 04, 2005 12:57 PM
To: nutch-dev@incubator.apache.org
Subject: [jira] Commented: (NUTCH-66)
Andrzej,
This does NOT work.
Still complains when it sees the domain name without a leading period.
-Original Message-
From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Monday, July 04, 2005 12:57 PM
To: nutch-dev@incubator.apache.org
Subject: [jira] Commented: (NUTCH-66)
Andrzej,
Thankx -- This works!!!
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Monday, July 04, 2005 11:55 AM
To: nutch-dev@lucene.apache.org
Subject: Re: both html parser have bug with javascript
Chirag Chaman wrote:
> Andrzej,
>
> Thank you -- and here we
Andy Liu wrote:
However, somebody correct me if I'm wrong, but I don't think you can
update individual ArrayFile entries once they've been written. So
while you're looping over each ParseData entry, you can write your
updated ParseData objects to a temporary ArrayFile and replace it with
the ol
Jérôme Charron wrote:
I think, this is an issue for all detection mechanisms...
For the content-type it is the same problem: What is the right value, the
one provided by the protocol layer, or the one provided by the extension
mapping, or the one provided by the detection (mime-magic)?
I thi
You can use a SegmentReader object to give you references to the
FetcherOutput, ParseData, and Content objects for each page in the
segment. The raw page data is encapsulated within the Content object
so you can parse out whatever you want from it.
However, somebody correct me if I'm wrong, but I
> I have an issue with the language detection plugin, which I'm not sure
> how to address. The plugin first tries to extract the language
> identifier from meta tags. However, meta tag values people put there are
> often completely wrong, or follow obscure pseudo-standards.
>
> Example: there is a
Jerome,
I have an issue with the language detection plugin, which I'm not sure
how to address. The plugin first tries to extract the language
identifier from meta tags. However, meta tag values people put there are
often completely wrong, or follow obscure pseudo-standards.
Example: there i
Hi!
I'm new to this list, so hello to you all.
Here's the gig - I have crawled and indexed a bunch of pages. The HTML
Parser used in nutch only parses out the title, text, metadata and
outlinks. Is there any way to extend this set of attributes
post-crawling (i.e, without rewriting HtmlParser.jav
[ http://issues.apache.org/jira/browse/NUTCH-68?page=all ]
Andrzej Bialecki updated NUTCH-68:
---
Attachment: FreeFetchlistTool.java
> A tool to generate arbitrary fetchlists
> ---
>
> Key: NUTCH-68
>
A tool to generate arbitrary fetchlists
---
Key: NUTCH-68
URL: http://issues.apache.org/jira/browse/NUTCH-68
Project: Nutch
Type: New Feature
Components: fetcher
Reporter: Andrzej Bialecki
Assigned to: Andrzej Bialecki
13 matches
Mail list logo