[Nutch-dev] problems running crawl tool

2005-04-26 Thread Chris Mattmann
Hi Folks, I've recently encountered the following error using the crawl tool: 050426 214400 fetching http://search.csmonitor.com/specials/neocon/index.html 050426 214401 fetching http://perspolis.usc.edu/Users/shahram/ 050426 214401 fetching http://www.cnn.com/rssclick/2005/TECH/science/

[Nutch-dev] Re: Where are the nutch experts?

2005-04-26 Thread Nutch开发邮件
Very good,I will try to do it! 2005/4/27, Andy Liu <[EMAIL PROTECTED]>: > > You can cut and paste this code into any indexing plugin, or create a new > one: > > // add links > Outlink[] outlinks = parse.getData().getOutlinks(); > int end = Math.min(outlinks.length, > UpdateDatabaseTool.MAX_OUTL

Re: [Nutch-dev] Re: parse-mp3 dependency missing

2005-04-26 Thread Hasan Diwan
I'll work on this. On 25/04/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Someone could re-write this plugin to use the Swing RTF parser that > comes with the JVM: -- Cheers, Hasan Diwan <[EMAIL PROTECTED]> --- SF.Net email is sponsored by:

[Nutch-dev] Re: Where are the nutch experts?

2005-04-26 Thread Andy Liu
sorry, typo in last email: "all searches" = "allow searches" On 4/26/05, Andy Liu <[EMAIL PROTECTED]> wrote: > You can cut and paste this code into any indexing plugin, or create a new one: > >// add links > Outlink[] outlinks = parse.getData().getOutlinks(); > int end = Math.min(outl

[Nutch-dev] Re: Where are the nutch experts?

2005-04-26 Thread Andy Liu
You can cut and paste this code into any indexing plugin, or create a new one: // add links Outlink[] outlinks = parse.getData().getOutlinks(); int end = Math.min(outlinks.length, UpdateDatabaseTool.MAX_OUTLINKS_PER_PAGE); for (int i = 0; i < end; i++) { Outlink link = outli

[Nutch-dev] Where are the nutch experts?

2005-04-26 Thread Marco PV
Hi, Anyone here expert on Nutch could help me to find a way to use the class GetLinks??? It seems that my messages are being ignored... Isn't this a great feature to search for "link:www.xxx.com"? Or even to be able to show where a image come from in the case of searching image files Hope s

[Nutch-dev] Re: Error at building nutch with ant.

2005-04-26 Thread Doug Cutting
Jakob Heidebrecht wrote: i get this error when i try to build nutch with ant. What version of Java are you using? What version of Nutch are you compiling? On what platform? Doug --- SF.Net email is sponsored by: Tell us your software developme

[Nutch-dev] Error at building nutch with ant.

2005-04-26 Thread Jakob Heidebrecht
Hi, i get this error when i try to build nutch with ant. Does somebody know what it is? Regards Jakob compile-core: [javac] Compiling 2 source files to /data/nutch/trunk/build/classes [javac] Found 1 semantic error compiling "/data/nutch/trunk/src/java/org/apache/nutch/ipc/Client.java"

[Nutch-dev] ezmlm warning

2005-04-26 Thread nutch-dev-help
Hi! This is the ezmlm program. I'm managing the nutch-dev@incubator.apache.org mailing list. I'm working for my owner, who can be reached at [EMAIL PROTECTED] Messages to you from the nutch-dev mailing list seem to have been bouncing. I've attached a copy of the first bounce message I received.

[Nutch-dev] [jira] Commented: (NUTCH-51) Removing a plugin after fetch but before indexing causes errors

2005-04-26 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-51?page=comments#action_63670 ] Doug Cutting commented on NUTCH-51: --- You need index-basic. > Removing a plugin after fetch but before indexing causes errors > --

[Nutch-dev] [jira] Created: (NUTCH-52) Parser plugin for MS Excel files

2005-04-26 Thread Rohit Kulkarni (JIRA)
Parser plugin for MS Excel files Key: NUTCH-52 URL: http://issues.apache.org/jira/browse/NUTCH-52 Project: Nutch Type: Improvement Components: fetcher Reporter: Rohit Kulkarni Priority: Trivial Attachments: parse-msexcel.

Re: [Nutch-dev] Getting HTML source

2005-04-26 Thread Piotr Kosiorowski
Hello, Page object does not contain html page content. To access fetched page content you have to iterate over segment data and extract it from there. Please have a look at SegmentReader class - it gives you a simple API to access all segment data. Regards Piotr Hasan Diwan wrote: On 23/04/05, r

[Nutch-dev] To get Nutch to print debug messages

2005-04-26 Thread rajat swarup
Hi, I'm trying to get some debug messages to be printed on the screen while the crawl is being done in Nutch. I just can't get it done. Just System.out.println() wouldn't work! The funny part is that I'm unable to create files using simple syntax like this inside the Fetcher.java class. bos = new

[Nutch-dev] [PATCH] - Datanode command line handling

2005-04-26 Thread Piotr Kosiorowski
Hello, I am attaching a minor patch for datanode command line handling that allows one to pass name of data directory as a command line parameter. If not passed data directory configured in nutch config file is used. It is very useful for running multiple instances of datanode on the same host -

Re: [Nutch-dev] Re: parse-mp3 dependency missing

2005-04-26 Thread Doug Cutting
Hasan Diwan wrote: On 22/04/05, Doug Cutting <[EMAIL PROTECTED]> wrote: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-rtf/lib/ Is there a licensing issue in importing this into subversion? Yes. Look inside that jar. It is LGPL. http://wiki.apache.org/jakarta/Using_LGPL'd_cod

[Nutch-dev] Filter at fetch list time

2005-04-26 Thread David Wallace
Hi all. Has anyone written a version of the FetchListTool that only adds a URL to the fetch list if it complies with a particular Regex URL filter? If so, would they be prepared to share? I need to do something like this, but I dislike re-inventing wheels. Essentially, I'm doing an intranet-ty

[Nutch-dev] [jira] Commented: (NUTCH-51) Removing a plugin after fetch but before indexing causes errors

2005-04-26 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-51?page=comments#action_63657 ] byron miller commented on NUTCH-51: --- I have index more setup. protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-more|query-(basic|site|url)|clustering-carrot2|ontolo

[Nutch-dev] [jira] Updated: (NUTCH-53) Parser plugin for Zip files

2005-04-26 Thread Rohit Kulkarni (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-53?page=all ] Rohit Kulkarni updated NUTCH-53: Attachment: parse-zip.zip The plugin is tested with the latest nutch SVN and seems to work fine. Currently handles and calls parsers for the following types of f