Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-15 Thread forwardswing
I have a page which is mainly controlled by javascript & ajax. So i need to parse it. Thanks a lot. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984018.html Sent from the Nutch - User mailing list archive

RE: Indexing HTML metatags from Nutch into Solr

2012-05-15 Thread Ing. Eyeris Rodriguez Rueda
Hi ML, this is the configuration for index-metatags plugins In your schema.xml(this file is the same in solr and nutch) In nutch-site.xml you need to put some like this: Look name and value(not put) metatags.names keywords;description;last_modified For plugin index-metatags: Indic

Re: Tika parser exception IndexOutOfBoundsException

2012-05-15 Thread Piet van Remortel
Just a quick remark: I recently had continuous problems setting that value to -1 probably due to extremely large pages or loop issues, causing timeouts. Setting the value to just 'very large' solved that. hth Piet On Tue, May 15, 2012 at 4:43 PM, Julien Nioche < lists.digitalpeb...@gmail.com>

Re: Tika parser exception IndexOutOfBoundsException

2012-05-15 Thread Julien Nioche
Try setting http.content.limit to a very large value or -1. The parser sometimes chokes on truncated content On 15 May 2012 15:17, LEVILLAIN Olivier wrote: > ** > > Hi, > > Each time I try to include a word file in my fetch/parse list, I always > get the following error: > > ** ** > > 20

Tika parser exception IndexOutOfBoundsException

2012-05-15 Thread LEVILLAIN Olivier
Hi, Each time I try to include a word file in my fetch/parse list, I always get the following error: 2012-05-15 15:02:40,319 ERROR tika.TikaParser - Error parsing http://mydomain/mydir/mydoc.doc java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 103424 in stream of length 62511

solrindex

2012-05-15 Thread Tolga
I'm going nuts. I issued the command bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5, went on to http://localhost:8983/solr/admin/stats.jsp and verified the index, but can't search within a web page. What am I doing wrong? Regards,

Re: HTTP error 400

2012-05-15 Thread Tolga
bin/nutch solrindex http://localhost:8983/solr/ crawldb -linkdb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2012-05-15 15:34:36 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/root/apache-nutch-1.4-bin/runtime/local/crawl/current On 5/15/12 2:05 PM

Re: Crawl-tool for iterative crawling?

2012-05-15 Thread Markus Jelsma
On Tuesday 15 May 2012 17:39:31 Vikas Hazrati wrote: > So once the crawl (which abstracts iterative crawls till the depth is > reached) is finished, is there a way to trigger a recrawl as well as a part > of some command line option so that Nutch continues to run as a daemon or > is shell script th

Re: Crawl-tool for iterative crawling?

2012-05-15 Thread Vikas Hazrati
So once the crawl (which abstracts iterative crawls till the depth is reached) is finished, is there a way to trigger a recrawl as well as a part of some command line option so that Nutch continues to run as a daemon or is shell script the way out? Regards | Vikas On Fri, May 11, 2012 at 8:26 PM,

Re: HTTP error 400

2012-05-15 Thread Tolga
Hi, I would like to report that the directory schema given in the command bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/* in Nutch FAQ doesn't match previous examples. That said, I'm totally confused. How can I index to solr if I don't crawl?

Re: HTTP error 400

2012-05-15 Thread Markus Jelsma
Please follow the step-by-step tutorial, it's explained there: http://wiki.apache.org/nutch/NutchTutorial On Tuesday 15 May 2012 13:40:26 Tolga wrote: > I'm a little confused. How can I not use the crawl command and execute > the separate crawl cycle commands at the same time? > > Regards, > > O

Re: Can't retrieve Tika parser for mime-type text/javascript

2012-05-15 Thread Markus Jelsma
I see, it doesn't work. The JSParser is known not to work very well, or work at all. Why do you want to parse JS anyway? It's not a very common practice to do so. On Monday 14 May 2012 01:35:01 forwardswing wrote: > I modify the parse-plugins.xml clip from: > > > > > to

Re: webpage download

2012-05-15 Thread Markus Jelsma
yes On Tuesday 15 May 2012 12:45:28 Taeseong Kim wrote: > is whole web content download possible? > > include Flash, Image, CSS, JavaScript

Re: HTTP error 400

2012-05-15 Thread Tolga
I'm a little confused. How can I not use the crawl command and execute the separate crawl cycle commands at the same time? Regards, On 5/11/12 9:40 AM, Markus Jelsma wrote: Ah, that means don't use the crawl command and do a little shell scripting to execute the separte crawl cycle commands, s

nutch 1-4 with solr-4

2012-05-15 Thread ramires
hi I tried to index a huge url set with nutch-1.4 with hadoop-0.20. In reduce part it throws an error like this. I think some char break xml. Any idea how to resolve this? May 15, 2012 10:37:59 AM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [collection1] webapp=/solr path