Hi,
I'm converting Nutch into a focused crawler. I am looking at the
following files:
FetchListTool.java. I am able to find where the files get updated into
the db (line 558) but where is the page actually fetched by the
crawler?
Could anyone help me out? Am I looking in a completely wrong place?
Ditto!
-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Erik
Hatcher
Sent: Thursday, April 21, 2005 5:01 PM
To: nutch-dev@incubator.apache.org
Subject: [Nutch-dev] Re: [EMAIL PROTECTED] Mailinglist
I'm getting multiple messages to the list. I'm not showi
[ http://issues.apache.org/jira/browse/NUTCH-46?page=comments#action_63458
]
zhangjin commented on NUTCH-46:
---
I know your meaning,I think the nutch can be used in Linux very good,but I use
it in the windows 2000 environment.My code is showed below.
publi
I'm getting multiple messages to the list. I'm not showing as
subscribed to the sourceforge list, but I get 3 copies of each Nutch
message. I need to get that straightened out sometime.
Erik
On Apr 20, 2005, at 1:07 PM, Doug Cutting wrote:
Michael Wechner wrote:
Sorry if this might be
On 21/04/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> If someone can convince the developers to release this under an
> acceptable license (Apache, BSD, Artistic, MIT/X, MIT/W3C, MPL 1,1,
> etc.) then we can include it in Nutch at Apache.
I cannot locate the RTF parser's library dependency either
[EMAIL PROTECTED] wrote:
I now understad the solution of the 'deply same pages' solution reported
to JIRA
(like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63418
]
byron miller commented on NUTCH-13:
---
If we wan't to support IP's lets do it both ways.
Banned list:
ipdeny.txt or something similar that contains an ip address range/subnet
[ http://issues.apache.org/jira/browse/NUTCH-48?page=all ]
Andy Liu updated NUTCH-48:
--
Attachment: spell-check.patch
run this command:
bin/nutch org.apache.nutch.spell.NGramSpeller -i [main index] -o [output
spelling index] -f content -minThreshold 500
to ge
Alan Wang wrote:
String lastModified = metaData.getProperty("last-modified");
if (lastModified == null)
return doc;
If the metaData does not contain a "last-modified" entry (from the http
headers) then the document ends up with no last-modified field, and
hence nothing to sort it on
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63416
]
Matthias Jaekle commented on NUTCH-13:
--
If fetcher is the only task which already runs the dns-lookup it might be the
best place to implement the ip filter there to avoid
Hasan Diwan wrote:
The jar file required by this plugin is missing from the repository.
The problem is that, as far as I can tell, the license for this software
does not permit it to be re-distributed with Apache software.
I believe this software is available under LGPL. That's what the
Source
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63403
]
Andrzej Bialecki commented on NUTCH-13:
Let's not be too hasty... There are legitimate cases when numeric IPs, even
from the private address-spaces are appropriate an
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63395
]
byron miller commented on NUTCH-13:
---
Would it make sense to ignore all IP based URLs? Typically for me IP urls are
short lived, mirror servers, load balanced sites, proxy h
[ http://issues.apache.org/jira/browse/NUTCH-39?page=comments#action_63396
]
byron miller commented on NUTCH-39:
---
Here is a nice taglib to do pagination. I'm not sure about the possible
performance hits yet, i use code similar to the one posted here.
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63397
]
Matthias Jaekle commented on NUTCH-13:
--
Yes. But to solve this problem you have to ignore all urls pointing to IPs
starting with 127. For example: www.tik24.de points to
[ http://issues.apache.org/jira/browse/NUTCH-48?page=comments#action_63390
]
Andy Liu commented on NUTCH-48:
---
I have implemented a rough version of this feature using David Spencer's code.
I will submit a patch when I get the chance.
> "Did you mean"
Flag for generate to fetch only new pages to complement the -refetchonly flag
-
Key: NUTCH-49
URL: http://issues.apache.org/jira/browse/NUTCH-49
Project: Nutch
Type: New Feature
Components:
[ http://issues.apache.org/jira/browse/NUTCH-49?page=all ]
Luke Baker updated NUTCH-49:
Attachment: fetchnewonly.patch
Attached is a patch that provides this functionality to the FetchListTool
(generate).
> Flag for generate to fetch only new pages to comp
Dear Doug,
I now understad the solution of the 'deply same pages' solution reported
to JIRA
(like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2
> Can you please suggest how to go about implementing this? I would like
> to add this check.
In the HttpResponse class, just add something like (it uses the
If-Modified-Since header, not the HEAD method) :
reqStr.append("If-Modified-Since: ");
reqStr.append(TheDateToCheck);
reqStr.append("\r\n"
>
> The bigger issue, however, is how you deal with causing the byte sequence
> (or so called "magic characters") in the mime types configuration file to
> recognize that a file is in fact an RSS file. With so many different types
> of valid feeds (RSS 2.0, 0.9, 1.0, ATOM, and its many versions),
That' s good,thanks
2005/4/21, Alan Wang <[EMAIL PROTECTED]>:
>
> Thanks.
>
> I am sorry that I thought the message is not sent and I resend it. :(.
> And I am sorry that I did not describe it clearly.
>
> The two item that Doug mentioned is not the source of this problem
> because I have alre
Hi Doug,
Do anyone working on this issue? If none, I will go on.
I suppose it is not hard to support "indexing locally and searching
remotely".
A simple way to implement this would be to change the protocol-file
plugin to handle http urls (add protocol-name="http" in plugin.xml),
then modify Fi
23 matches
Mail list logo