[Nutch-dev] RE: Huge Problem trying to develop plugin for Nutch

2005-03-25 Thread Chris Mattmann
Hi,    For whatever reason (maybe file filtering) I think that my test2.java file that I attached didn’t go through. So, I renamed the extension to .txt. Let’s see if it goes through this time.   Sorry about having to send another email. Thanks very much for any help!   Cheers,   Chr

[Nutch-dev] Huge Problem trying to develop plugin for Nutch

2005-03-25 Thread Chris Mattmann
Hi Folks,    My name is Chris Mattmann: I work at the Jet Propulsion Laboratory in Pasadena, CA, U.S.A. I'm new to the list. Nice to meet you all.   I am having some * major * trouble trying to build an RSS content parser plugin for nutch. My plugin is based on the parse-pdf plugin stru

[Nutch-dev] Re: Mime/Magic mapper

2005-03-25 Thread John X
On Sat, Mar 26, 2005 at 01:48:05AM +0100, J?r?me Charron wrote: > Does somebody know why John Xing deactivate the mime.magic.file > support in protocol-file plugin? The "disabled" are only hooks to use mimetype/magic mapper. The mapper I used in a project had license issue (can't be redistributed)

[Nutch-dev] Mime/Magic mapper

2005-03-25 Thread Jérôme Charron
Does somebody know why John Xing deactivate the mime.magic.file support in protocol-file plugin? I'm writing an mbox-parser plugin, and typically, an mbox has no extension => it's mime type could not be determined using extension/mime-type mapper. For an mbox, the mime-type can only be defined by "

Re: [Nutch-dev] Re: Licenses

2005-03-25 Thread Doug Cutting
Andy Hedges wrote: So I've modified the ant file and it works in the following way. If the file hasn't been downloaded it does so. Even if the src has been download it may not have been compiled and jared and so that is checked and if necessary it is build. It then removes all the intermediate file

[Nutch-dev] Re: jobtracker

2005-03-25 Thread Doug Cutting
Stefan Groschupf wrote: However I notice in JobConf line 73 String defaultValue = "nutch.jar"; ... get("mapred.jar", defaultValue); May mapred.jar need to be setted somewhere, Grep.java doesn't set it and it is not in the nutch-default.xml Have you run 'ant' recently? This creates a symlink nam

[Nutch-dev] Re: jobtracker

2005-03-25 Thread Stefan Groschupf
Did you run the Grep.main using bin/nutch? That should do the trick: Yes, I used bin/nutch see stack attached. In and out are directories of files. Re is the regex. Group is the optional group within the regex to select when mapping. Note that you should also define "mapred.job.tracker" in nu

[Nutch-dev] Re: jobtracker

2005-03-25 Thread Doug Cutting
Stefan Groschupf wrote: I was trying the Grep job, however it fails since nutch.jar was not found. Did you run the Grep.main using bin/nutch? That should do the trick: bin/nutch org.apache.nutch.mapReduce.demo.Grep[] In and out are directories of files. Re is the regex. Group is the optio

[Nutch-dev] Re: jobtracker

2005-03-25 Thread Stefan Groschupf
This is not a fatal error. It just means that the web server that permits monitoring has failed to start. MapReduce will still work fine, and you can monitor jobs using the JobClient API. I see, thanks. I was trying the Grep job, however it fails since nutch.jar was not found. 050325 2001

Re: [Nutch-dev] Re: Needing more protocols

2005-03-25 Thread PA
On Mar 25, 2005, at 17:59, Doug Cutting wrote: 1. Why should we replace it? java.net.URL is connection oriented. What is the problem with java.net.URL? java.net.URI is meant to handle parsing. Does it reject unknown protocols? Yes. Another issue is that something as innocent looking as equals() c

[Nutch-dev] Re: Needing more protocols

2005-03-25 Thread Jérôme Charron
> > Any ideas for building protocol plugins not using the java.net.URL ? All protocol could be binded to a URL scheme. No? For instance, for imap you can refer to IMAP URL Scheme (http://www.isi.edu/in-notes/rfc2192.txt) > 1. Why should we replace it? What is the problem with java.net.URL? > Does

[Nutch-dev] RE: International Parser

2005-03-25 Thread Britz, Thibaut
Hi, You could use sun.text.Normalizer. (see http://www.rgagnon.com/javadetails/java-0456.html). Maybe you should also check first what language the text is written in, before applying the filter. Thibaut -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thu 3/24/2005

Re: [Nutch-dev] Re: Licenses

2005-03-25 Thread Andy Hedges
Doug Cutting wrote: > Andy Hedges wrote: > >> As you can see for rtf I need to do more than just download the jar >> file as there is not a precompiled version on the net. I have to >> download the source and then run a number of steps to get to the >> resulting jar. I'm not sure that the plug

[Nutch-dev] Re: Needing more protocols

2005-03-25 Thread Doug Cutting
Konstantin Ott wrote: The protocol plugins seem to be the right starting point. But here and at other places like the Fetcher I see that pages are basically needing the java.net.URL. Actually only for splitting the url in host,port, path So we only need the URLStreamHandler in the protocol p

[Nutch-dev] Needing more protocols

2005-03-25 Thread Konstantin Ott
hello, looking a little bit at nutch it seems great and we would like to use/extend it for something like a personal/corporate knowledgement tool. Well therefor its necessary to index other content to. Especially we need the content of imap folders and maybe some database content. The protocol

[Nutch-dev] Re: jobtracker

2005-03-25 Thread Doug Cutting
Stefan Groschupf wrote: Namenode runs without problems but the tracker throws an exception: Exception in thread "main" java.io.IOException: Could not start HTTP server at org.apache.nutch.mapReduce.JobTrackerInfoServer.start(JobTrackerInfoServ er.java:104) It looks like that a kind of

[Nutch-dev] jobtracker

2005-03-25 Thread Stefan Groschupf
Hi developer, what is the trick to run a job tracker? I notice a set of jetty libraries missing in the apache suberversion, is there any reason for that? Looks like it is not a license problem. Anyway I had started a namenode and a jobtracker after adding some jars. Namenode runs without problems

[Nutch-dev] Google and 302 redirect problem

2005-03-25 Thread Massimo Miccoli
Millions of Pages Google Hijacked using ODP Feed... http://slashdot.org/article.pl?sid=05/03/23/1446237&from=rss

Re: [Nutch-dev] Re: International Parser

2005-03-25 Thread cn
This simply would be a great thing... Selon Doug Cutting <[EMAIL PROTECTED]>: > Would it be a problem to simply make this conversion for all languages? > Does Google distinguish between "é", "è" and "e" for other languages? Th

Re: [Nutch-dev] Re: Licenses

2005-03-25 Thread Doug Cutting
Andy Hedges wrote: As you can see for rtf I need to do more than just download the jar file as there is not a precompiled version on the net. I have to download the source and then run a number of steps to get to the resulting jar. I'm not sure that the plugin's build.xml is the place to be doi

Re: [Nutch-dev] Re: Licenses

2005-03-25 Thread Andy Hedges
Doug, As you can see for rtf I need to do more than just download the jar file as there is not a precompiled version on the net. I have to download the source and then run a number of steps to get to the resulting jar. I'm not sure that the plugin's build.xml is the place to be doing this as i