Hi,
the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
suitable for sites using single quotes for meta http-equiv
Example: meta http-equiv='Content-Type' content='text/html;
charset=iso-8859-1'
We experienced a couple of pages with that kind of quotes and
Hi Alex,
I cannot locate the java file you mention at
org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...
Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions
above it is identical) it appears that you are right the double quotes for
meta http-equiv
Hi,
It is a plugin found in src/plugins/parse-html/.
Cheers
On Tuesday 07 June 2011 18:01:22 lewis john mcgibbney wrote:
Hi Alex,
I cannot locate the java file you mention at
org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...
Having a quick look at
Hi,
I took a look to the recrawl script and noticed that all the steps except urls
injection are repeated at the consequent indexing and wondered why would we
generate new segments?
Is it possible to do fetch, update for all previous $s1..$sn , invertlink and
index steps.
Thanks.
Alex.
Hi,
I took a look to the recrawl script and noticed that all the steps except
urls injection are repeated at the consequent indexing and wondered why
would we generate new segments? Is it possible to do fetch, update for all
previous $s1..$sn , invertlink and index steps.
No, the
Ticket:
https://issues.apache.org/jira/browse/NUTCH-1006
Hi,
the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
suitable for sites using single quotes for meta http-equiv
Example: meta http-equiv='Content-Type' content='text/html;
charset=iso-8859-1'
We
Hi,
To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.
On Tue, Jun 7, 2011 at 9:01 PM, Markus
Hi Brian,
Would be easier to simply generate a job file and the script in bin to run
the tasks. Hardcopying the plugins + jars on each machine is not practical.
The reason we separated the jars+plugins approach from the job in the
runtimes for 1.3 was to avoid possible conflicts.
Julien
I
or you can modify the code from the crawldb reader and get it to dump only
the keys. If your crawldb is large, regex will take forever
On 7 June 2011 22:31, Markus Jelsma markus.jel...@openindex.io wrote:
Well, you can dump the crawldb using the bin/nutch readdb command. You'd
still
need to
On Wed, Jun 8, 2011 at 4:11 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
Dear nutch developers,
Nutch by default takes seeds from file system (-seedDir). Is it possible to
change it to take seeds from mysql table?
In theory, yes, but i would not recommend it. It would be quite a job
Hi Folks,
This VOTE has passed with the following tallies:
+1 Nutch PMC
Chris Mattmann
Markus Jelsma
Julien Nioche
Lewis John McGibbney
I'll go ahead and push the release to the mirrors and release the Maven repo to
Central and then send an ANNOUNCE.
Thanks!
Cheers,
Chris
11 matches
Mail list logo