[Nutch-dev] [jira] Updated: (NUTCH-54) Fetcher improvements

2005-05-17 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-54?page=all ] Andrzej Bialecki updated NUTCH-54: --- Attachment: 20050518.patch new-plugins.zip Updated patch. Fixed problems with redirection handling, improved support for JavaScript and for

[Nutch-dev] Re: tools cleanup

2005-05-17 Thread Doug Cutting
Sami Siren wrote: should we introduce a new package for these: NutchConfigurable, NutchConfigured and the upcoming action classes - I've added these in util in the mapred branch and will use them as I rewrite tools to use MapReduce. I'll commit them soon. Doug --

[Nutch-dev] SEVERE error: key out of order

2005-05-17 Thread Andrzej Bialecki
Hi, When running Fetcher sometimes it dies with the following exception: 050517 212854 SEVERE error writing output:java.io.IOException: key out of order: 3420 after 3420 java.io.IOException: key out of order: 3420 after 3420 at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:128)

[Nutch-dev] IOException in link analysis with ndfs-based web db

2005-05-17 Thread Pablo Mayrgundter
Exception in thread "main" java.io.IOException: Could not obtain new output block for file /db/linkstats.txt at net.nutch.ndfs.NDFSClient$NameNodeCaller.getNewOutputBlock(NDFSClient.java:907) This file already exists, so perhaps it needs to be deleted first? Would appreciate any pointers

[Nutch-dev] Re: Update: HTTPClient for protocol-http and protocol-https

2005-05-17 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: You can download the patch from here: http://www.getopt.org/nutch/20050507.patch I have not yet had a chance to try this. Following are some quick comments from reading the patch. Overall I think this is great stuff. 1. Why does an HTMLMetaTags n

[Nutch-dev] Protocol-http - problematic behaviour of the address blocking routine

2005-05-17 Thread Andrzej Bialecki
Hi, This is just an observation and a warning for those of you who are crawling single sites in depth, and encountered frequent "Exceeded http.max.delays" exception. Assume the following scenario: a user runs the CrawlTool to crawl a single site. Fetchlists generated by the CrawlTool will conta

[Nutch-dev] Re: NDFS Questions

2005-05-17 Thread Doug Cutting
Pablo Mayrgundter wrote: I'm testing a deployment of Nutch at work and am trying to decide what filesystem to use. I got the NDFS demo working, and am excited to use it, but it looks pretty new. Should I consider using it for production? I'm considering storing quite a lot of data, in the 10-100

[Nutch-dev] Re: tools cleanup

2005-05-17 Thread Sami Siren
should we introduce a new package for these: NutchConfigurable, NutchConfigured and the upcoming action classes - org.apache.nutch.action ? -- Sami Siren Stefan Groschupf wrote: Hi, Doug, can you or someone else please commit the classes you suggested, I think most / all agree and we can start

[Nutch-dev] Re: Update: HTTPClient for protocol-http and protocol-https

2005-05-17 Thread Doug Cutting
Andrzej Bialecki wrote: You can download the patch from here: http://www.getopt.org/nutch/20050507.patch I have not yet had a chance to try this. Following are some quick comments from reading the patch. Overall I think this is great stuff. 1. Why does an HTMLMetaTags need to be passed to P