I have a page which is mainly controlled by javascript & ajax.
So i need to parse it.
Thanks a lot.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984018.html
Sent from the Nutch - User mailing list archive
Hi ML, this is the configuration for index-metatags plugins
In your schema.xml(this file is the same in solr and nutch)
In nutch-site.xml you need to put some like this:
Look name and value(not put)
metatags.names
keywords;description;last_modified
For plugin index-metatags: Indic
Just a quick remark: I recently had continuous problems setting that value
to -1 probably due to extremely large pages or loop issues, causing
timeouts.
Setting the value to just 'very large' solved that.
hth
Piet
On Tue, May 15, 2012 at 4:43 PM, Julien Nioche <
lists.digitalpeb...@gmail.com>
Try setting http.content.limit to a very large value or -1. The parser
sometimes chokes on truncated content
On 15 May 2012 15:17, LEVILLAIN Olivier wrote:
> **
>
> Hi,
>
> Each time I try to include a word file in my fetch/parse list, I always
> get the following error:
>
> ** **
>
> 20
Hi,
Each time I try to include a word file in my fetch/parse list, I always get the
following error:
2012-05-15 15:02:40,319 ERROR tika.TikaParser - Error parsing
http://mydomain/mydir/mydoc.doc
java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 103424 in
stream of length 62511
I'm going nuts.
I issued the command bin/nutch crawl urls -solr
http://localhost:8983/solr/ -depth 3 -topN 5, went on to
http://localhost:8983/solr/admin/stats.jsp and verified the index, but
can't search within a web page. What am I doing wrong?
Regards,
bin/nutch solrindex http://localhost:8983/solr/ crawldb -linkdb
crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2012-05-15 15:34:36
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/root/apache-nutch-1.4-bin/runtime/local/crawl/current
On 5/15/12 2:05 PM
On Tuesday 15 May 2012 17:39:31 Vikas Hazrati wrote:
> So once the crawl (which abstracts iterative crawls till the depth is
> reached) is finished, is there a way to trigger a recrawl as well as a part
> of some command line option so that Nutch continues to run as a daemon or
> is shell script th
So once the crawl (which abstracts iterative crawls till the depth is
reached) is finished, is there a way to trigger a recrawl as well as a part
of some command line option so that Nutch continues to run as a daemon or
is shell script the way out?
Regards | Vikas
On Fri, May 11, 2012 at 8:26 PM,
Hi,
I would like to report that the directory schema given in the command
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb
crawldb/linkdb crawldb/segments/* in Nutch FAQ doesn't match previous
examples.
That said, I'm totally confused. How can I index to solr if I don't crawl?
Please follow the step-by-step tutorial, it's explained there:
http://wiki.apache.org/nutch/NutchTutorial
On Tuesday 15 May 2012 13:40:26 Tolga wrote:
> I'm a little confused. How can I not use the crawl command and execute
> the separate crawl cycle commands at the same time?
>
> Regards,
>
> O
I see, it doesn't work. The JSParser is known not to work very well, or work
at all. Why do you want to parse JS anyway? It's not a very common practice
to do so.
On Monday 14 May 2012 01:35:01 forwardswing wrote:
> I modify the parse-plugins.xml clip from:
>
>
>
>
> to
yes
On Tuesday 15 May 2012 12:45:28 Taeseong Kim wrote:
> is whole web content download possible?
>
> include Flash, Image, CSS, JavaScript
I'm a little confused. How can I not use the crawl command and execute
the separate crawl cycle commands at the same time?
Regards,
On 5/11/12 9:40 AM, Markus Jelsma wrote:
Ah, that means don't use the crawl command and do a little shell
scripting to execute the separte crawl cycle commands, s
hi
I tried to index a huge url set with nutch-1.4 with hadoop-0.20. In
reduce part it throws an error like this. I think some char break xml. Any
idea how to resolve this?
May 15, 2012 10:37:59 AM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: [collection1] webapp=/solr path
15 matches
Mail list logo