[Nutch-dev] Looking for crawler

2005-04-21 Thread rajat swarup
Hi, I'm converting Nutch into a focused crawler. I am looking at the following files: FetchListTool.java. I am able to find where the files get updated into the db (line 558) but where is the page actually fetched by the crawler? Could anyone help me out? Am I looking in a completely wrong place?

RE: [Nutch-dev] Re: [EMAIL PROTECTED] Mailinglist

2005-04-21 Thread Chirag Chaman
Ditto! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Erik Hatcher Sent: Thursday, April 21, 2005 5:01 PM To: nutch-dev@incubator.apache.org Subject: [Nutch-dev] Re: [EMAIL PROTECTED] Mailinglist I'm getting multiple messages to the list. I'm not showi

[Nutch-dev] [jira] Commented: (NUTCH-46) the NDFS problem(Could not obtain new output block for file)

2005-04-21 Thread zhangjin (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-46?page=comments#action_63458 ] zhangjin commented on NUTCH-46: --- I know your meaning,I think the nutch can be used in Linux very good,but I use it in the windows 2000 environment.My code is showed below. publi

[Nutch-dev] Re: [EMAIL PROTECTED] Mailinglist

2005-04-21 Thread Erik Hatcher
I'm getting multiple messages to the list. I'm not showing as subscribed to the sourceforge list, but I get 3 copies of each Nutch message. I need to get that straightened out sometime. Erik On Apr 20, 2005, at 1:07 PM, Doug Cutting wrote: Michael Wechner wrote: Sorry if this might be

Re: [Nutch-dev] Re: parse-mp3 dependency missing

2005-04-21 Thread Hasan Diwan
On 21/04/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > If someone can convince the developers to release this under an > acceptable license (Apache, BSD, Artistic, MIT/X, MIT/W3C, MPL 1,1, > etc.) then we can include it in Nutch at Apache. I cannot locate the RTF parser's library dependency either

Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

2005-04-21 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I now understad the solution of the 'deply same pages' solution reported to JIRA (like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/

[Nutch-dev] [jira] Commented: (NUTCH-13) If dns points to 127.0.0.1, the url is also crawled

2005-04-21 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63418 ] byron miller commented on NUTCH-13: --- If we wan't to support IP's lets do it both ways. Banned list: ipdeny.txt or something similar that contains an ip address range/subnet

[Nutch-dev] [jira] Updated: (NUTCH-48) "Did you mean" query enhancement/refignment feature request

2005-04-21 Thread Andy Liu (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-48?page=all ] Andy Liu updated NUTCH-48: -- Attachment: spell-check.patch run this command: bin/nutch org.apache.nutch.spell.NGramSpeller -i [main index] -o [output spelling index] -f content -minThreshold 500 to ge

Re: [Nutch-dev] Re: Sort does not work properly

2005-04-21 Thread Doug Cutting
Alan Wang wrote: String lastModified = metaData.getProperty("last-modified"); if (lastModified == null) return doc; If the metaData does not contain a "last-modified" entry (from the http headers) then the document ends up with no last-modified field, and hence nothing to sort it on

[Nutch-dev] [jira] Commented: (NUTCH-13) If dns points to 127.0.0.1, the url is also crawled

2005-04-21 Thread Matthias Jaekle (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63416 ] Matthias Jaekle commented on NUTCH-13: -- If fetcher is the only task which already runs the dns-lookup it might be the best place to implement the ip filter there to avoid

[Nutch-dev] Re: parse-mp3 dependency missing

2005-04-21 Thread Doug Cutting
Hasan Diwan wrote: The jar file required by this plugin is missing from the repository. The problem is that, as far as I can tell, the license for this software does not permit it to be re-distributed with Apache software. I believe this software is available under LGPL. That's what the Source

[Nutch-dev] [jira] Commented: (NUTCH-13) If dns points to 127.0.0.1, the url is also crawled

2005-04-21 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63403 ] Andrzej Bialecki commented on NUTCH-13: Let's not be too hasty... There are legitimate cases when numeric IPs, even from the private address-spaces are appropriate an

[Nutch-dev] [jira] Commented: (NUTCH-13) If dns points to 127.0.0.1, the url is also crawled

2005-04-21 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63395 ] byron miller commented on NUTCH-13: --- Would it make sense to ignore all IP based URLs? Typically for me IP urls are short lived, mirror servers, load balanced sites, proxy h

[Nutch-dev] [jira] Commented: (NUTCH-39) pagination in search result

2005-04-21 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-39?page=comments#action_63396 ] byron miller commented on NUTCH-39: --- Here is a nice taglib to do pagination. I'm not sure about the possible performance hits yet, i use code similar to the one posted here.

[Nutch-dev] [jira] Commented: (NUTCH-13) If dns points to 127.0.0.1, the url is also crawled

2005-04-21 Thread Matthias Jaekle (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-13?page=comments#action_63397 ] Matthias Jaekle commented on NUTCH-13: -- Yes. But to solve this problem you have to ignore all urls pointing to IPs starting with 127. For example: www.tik24.de points to

[Nutch-dev] [jira] Commented: (NUTCH-48) "Did you mean" query enhancement/refignment feature request

2005-04-21 Thread Andy Liu (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-48?page=comments#action_63390 ] Andy Liu commented on NUTCH-48: --- I have implemented a rough version of this feature using David Spencer's code. I will submit a patch when I get the chance. > "Did you mean"

[Nutch-dev] [jira] Created: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2005-04-21 Thread Luke Baker (JIRA)
Flag for generate to fetch only new pages to complement the -refetchonly flag - Key: NUTCH-49 URL: http://issues.apache.org/jira/browse/NUTCH-49 Project: Nutch Type: New Feature Components:

[Nutch-dev] [jira] Updated: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2005-04-21 Thread Luke Baker (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-49?page=all ] Luke Baker updated NUTCH-49: Attachment: fetchnewonly.patch Attached is a patch that provides this functionality to the FetchListTool (generate). > Flag for generate to fetch only new pages to comp

Re: [Nutch-dev] [jira] Commented: (NUTCH-7) please update it with the svn

2005-04-21 Thread [EMAIL PROTECTED]
Dear Doug, I now understad the solution of the 'deply same pages' solution reported to JIRA (like:http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2

[Nutch-dev] Re: Incremental Crawling

2005-04-21 Thread Jérôme Charron
> Can you please suggest how to go about implementing this? I would like > to add this check. In the HttpResponse class, just add something like (it uses the If-Modified-Since header, not the HEAD method) : reqStr.append("If-Modified-Since: "); reqStr.append(TheDateToCheck); reqStr.append("\r\n"

[Nutch-dev] Re: parse-rss fetch problems

2005-04-21 Thread Jérôme Charron
> > The bigger issue, however, is how you deal with causing the byte sequence > (or so called "magic characters") in the mime types configuration file to > recognize that a file is in fact an RSS file. With so many different types > of valid feeds (RSS 2.0, 0.9, 1.0, ATOM, and its many versions),

Re: [Nutch-dev] Re: Sort does not work properly

2005-04-21 Thread zhang jin
That' s good,thanks 2005/4/21, Alan Wang <[EMAIL PROTECTED]>: > > Thanks. > > I am sorry that I thought the message is not sent and I resend it. :(. > And I am sorry that I did not describe it clearly. > > The two item that Doug mentioned is not the source of this problem > because I have alre

Re: [Nutch-dev] filesystem indexing

2005-04-21 Thread Boris Kröger
Hi Doug, Do anyone working on this issue? If none, I will go on. I suppose it is not hard to support "indexing locally and searching remotely". A simple way to implement this would be to change the protocol-file plugin to handle http urls (add protocol-name="http" in plugin.xml), then modify Fi