Re: HTTP error 400

2012-05-18 Thread Jean-François Gingras
Yes. Also take a look at this page [1] for script exemples. [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script On Thu, May 17, 2012 at 6:07 AM, Tolga wrote: > I'm still confused. You mean to use

Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Mattmann, Chris A (388J)
Hey Guys, Sorry I've been on hiatus enjoying a trip with my family :) I was hoping to respin rc #2 before I left, but I didn't find the spare cycles. Lewis, basically if you look through the rc #1 thread there are about 3-4 comments from Julien, you, and I think from Sami. I have them written do

Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Lewis John Mcgibbney
When the community is satisfied that we have a good release candidate and when the VOTE'ing suits the required conditions. Ultimately the timing for a release is down to the release manager but I think it is fair to say that we are on our way to getting 1.5 released soon as the (trunk) codebase is

RE: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Markus Jelsma
As soon as the release manager finds some spare time to manage the release process. Please be patient or build from trunk which is the next 1.5. -Original message- > From:Matthias Paul > Sent: Fri 18-May-2012 15:09 > To: user@nutch.apache.org > Subject: Re: [VOTE] Apache Nutch 1.5 rel

RE: Exclude certain mime-types

2012-05-18 Thread Markus Jelsma
-Original message- > From:Matthias Paul > Sent: Fri 18-May-2012 14:57 > To: user@nutch.apache.org > Subject: Exclude certain mime-types > > How can I exlude certain mime-types from crawling, for example Word-documents? > If I have parse-tika in plugin.includes it will parse them. Do

Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Matthias Paul
When will Nutch 1.5 be released? Matthias On Wed, Apr 18, 2012 at 1:46 PM, Bharat Goyal wrote: > +1 > > > On Monday 16 April 2012 12:34 PM, Markus Jelsma wrote: >> >>  +1 >> >>  On Mon, 16 Apr 2012 05:43:22 +, "Mattmann, Chris A (388J)" >>    wrote: >>> >>> Hi Folks, >>> >>> A candidate for

Exclude certain mime-types

2012-05-18 Thread Matthias Paul
How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes it will parse them. Do I have to change parse-plugins.xml? I can't exclude them in regex-urlfilter as the .doc extension is not present in the urls. Thanks Matthias

Re: ERROR solr.SolrIndexer - java.io.IOException: Job failed!

2012-05-18 Thread Jim Chandler
You need to add the site field in your schema.xml - in your solr. Jim On Fri, May 18, 2012 at 12:58 AM, cameron tran wrote: > Hello > > I am trying to get Nutch 1.4 (downloaded binary) to do solrindex to > http://127.0.0.1:8983/solr/ but is getting the following error. Using Solr > 3.6.0.. Pleas

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-18 Thread Lewis John Mcgibbney
One final poin there which I forgot. The point of the parse-js plugin is to extract outlinks from JS pages. The page you supplied contained only one outlink to a page which no longer exists, so depending on what your purposes are you may not find the parse-js plugin of much help Lewis On Fri, May

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-18 Thread Lewis John Mcgibbney
I tried configuring my instance to fetch and parse your page with the following result lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local/bin$ ./nutch parsechecker http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js fetching: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js