Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-18 Thread Lewis John Mcgibbney
I tried configuring my instance to fetch and parse your page with the following result lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local/bin$ ./nutch parsechecker http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js fetching: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-18 Thread Lewis John Mcgibbney
One final poin there which I forgot. The point of the parse-js plugin is to extract outlinks from JS pages. The page you supplied contained only one outlink to a page which no longer exists, so depending on what your purposes are you may not find the parse-js plugin of much help Lewis On Fri,

Re: ERROR solr.SolrIndexer - java.io.IOException: Job failed!

2012-05-18 Thread Jim Chandler
You need to add the site field in your schema.xml - in your solr. Jim On Fri, May 18, 2012 at 12:58 AM, cameron tran cameront...@gmail.comwrote: Hello I am trying to get Nutch 1.4 (downloaded binary) to do solrindex to http://127.0.0.1:8983/solr/ but is getting the following error. Using

Exclude certain mime-types

2012-05-18 Thread Matthias Paul
How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes it will parse them. Do I have to change parse-plugins.xml? I can't exclude them in regex-urlfilter as the .doc extension is not present in the urls. Thanks Matthias

Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Matthias Paul
When will Nutch 1.5 be released? Matthias On Wed, Apr 18, 2012 at 1:46 PM, Bharat Goyal bharat.go...@shiksha.com wrote: +1 On Monday 16 April 2012 12:34 PM, Markus Jelsma wrote:  +1  On Mon, 16 Apr 2012 05:43:22 +, Mattmann, Chris A (388J)  chris.a.mattm...@jpl.nasa.gov  wrote: Hi

RE: Exclude certain mime-types

2012-05-18 Thread Markus Jelsma
-Original message- From:Matthias Paul magethle.nu...@gmail.com Sent: Fri 18-May-2012 14:57 To: user@nutch.apache.org Subject: Exclude certain mime-types How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes it

RE: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Markus Jelsma
As soon as the release manager finds some spare time to manage the release process. Please be patient or build from trunk which is the next 1.5. -Original message- From:Matthias Paul magethle.nu...@gmail.com Sent: Fri 18-May-2012 15:09 To: user@nutch.apache.org Subject: Re: [VOTE]

Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Lewis John Mcgibbney
When the community is satisfied that we have a good release candidate and when the VOTE'ing suits the required conditions. Ultimately the timing for a release is down to the release manager but I think it is fair to say that we are on our way to getting 1.5 released soon as the (trunk) codebase

Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-18 Thread Mattmann, Chris A (388J)
Hey Guys, Sorry I've been on hiatus enjoying a trip with my family :) I was hoping to respin rc #2 before I left, but I didn't find the spare cycles. Lewis, basically if you look through the rc #1 thread there are about 3-4 comments from Julien, you, and I think from Sami. I have them written

Re: HTTP error 400

2012-05-18 Thread Jean-François Gingras
Yes. Also take a look at this page [1]http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script for script exemples. [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script On Thu, May 17, 2012 at 6:07 AM, Tolga to...@ozses.net wrote: I'm still confused. You