Re: A Developer's getting started doc?

2006-05-02 Thread TDLN
Hi Andrew, you can either get one of the distributions, a nightly build, or check out directly from SVN to get the sources. Then I would suggest checking the targets in the ant build file; there are targets for compiling. cleaning and testing. Use 'ant tar' to make a release tarball that you

Re: Creating a throttle

2006-05-02 Thread TDLN
I think someting like this has already been done (apart from the daily changes you suggest) http://issues.apache.org/jira/browse/NUTCH-207 Rgrds, Thomas On 5/1/06, Fankhauser, Alain [EMAIL PROTECTED] wrote: Hello I'm thinking about to create a throttle, who let us decide at

Re: Content-Type inconsistency?

2006-05-02 Thread Jérôme Charron
I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html Nutch decided to treat as XML, not HTML. Yes, the current version of the mime-type resolver is a crude one. XML, HTML, RSS and all XML based

Re: mapred question

2006-05-02 Thread Doug Cutting
[EMAIL PROTECTED] wrote: As far as we understood from MapRed documentation all reduce tasks must be launched after last map task is finished e.g map and reduce must not work simultaneously. But often in logs we see such records: map 80%, reduce 10% and many more records where map is less then

Re: Content-Type inconsistency?

2006-05-02 Thread Doug Cutting
Jérôme Charron wrote: We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml Perhaps that would have worked also, but, with Apache, simply trusting the

RE: A Developer's getting started doc?

2006-05-02 Thread Wootton, Alan
I also have .classpath, and .project files for hadoop in Eclipse. Why are these not checked in? - alan -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 02, 2006 1:33 AM To: nutch-dev@lucene.apache.org Subject: Re: A Developer's getting started doc? Hi Andrew,

[jira] Created: (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration

2006-05-02 Thread Jake Vanderdray (JIRA)
Three new plugins that parse, index and query meta tags defined in the configuration Key: NUTCH-260 URL: http://issues.apache.org/jira/browse/NUTCH-260 Project: Nutch Type: New

[jira] Updated: (NUTCH-260) Three new plugins that parse, index and query meta tags defined in the configuration

2006-05-02 Thread Jake Vanderdray (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-260?page=all ] Jake Vanderdray updated NUTCH-260: -- Attachment: nutch_customizations.tar The attachment is a tarball of the plugin source. Three new plugins that parse, index and query meta tags defined in

Re: A Developer's getting started doc?

2006-05-02 Thread Lukas Vlcek
Thomas, I would really appreciate your .classpath and .project files for Eclipse (for Nutch-trunk). Could you send them to me? Or could you upload them somewhere? I don't think I am novice in terms of Eclipse but frankly I am to lazy configuring all these settings manually. I do use Maven all