Character encoding on Html-Pages

2011-06-07 Thread Alex F
Hi, the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not suitable for sites using single quotes for meta http-equiv Example: meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' We experienced a couple of pages with that kind of quotes and

Re: Character encoding on Html-Pages

2011-06-07 Thread lewis john mcgibbney
Hi Alex, I cannot locate the java file you mention at org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3... Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions above it is identical) it appears that you are right the double quotes for meta http-equiv

Re: Character encoding on Html-Pages

2011-06-07 Thread Markus Jelsma
Hi, It is a plugin found in src/plugins/parse-html/. Cheers On Tuesday 07 June 2011 18:01:22 lewis john mcgibbney wrote: Hi Alex, I cannot locate the java file you mention at org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3... Having a quick look at

Re: keeping index up to date

2011-06-07 Thread alxsss
Hi, I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments? Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps. Thanks. Alex.

Re: keeping index up to date

2011-06-07 Thread Markus Jelsma
Hi, I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments? Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps. No, the

Re: Character encoding on Html-Pages

2011-06-07 Thread Markus Jelsma
Ticket: https://issues.apache.org/jira/browse/NUTCH-1006 Hi, the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not suitable for sites using single quotes for meta http-equiv Example: meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' We

Re: keeping index up to date

2011-06-07 Thread lewis john mcgibbney
Hi, To add to Markus' comments, if you take a look at the script it is written in such a way that if run in safe mode it protects us against an error which may occur. If this is the case we an recover segments etc and take appropriate actions to resolve. On Tue, Jun 7, 2011 at 9:01 PM, Markus

Re: Nutch not crawling on a pre-existing hadoop cluster?

2011-06-07 Thread Julien Nioche
Hi Brian, Would be easier to simply generate a job file and the script in bin to run the tasks. Hardcopying the plugins + jars on each machine is not practical. The reason we separated the jars+plugins approach from the job in the runtimes for 1.3 was to avoid possible conflicts. Julien I

Re: Dump all urls from merged index

2011-06-07 Thread Julien Nioche
or you can modify the code from the crawldb reader and get it to dump only the keys. If your crawldb is large, regex will take forever On 7 June 2011 22:31, Markus Jelsma markus.jel...@openindex.io wrote: Well, you can dump the crawldb using the bin/nutch readdb command. You'd still need to

Re: Custom seed source

2011-06-07 Thread Fyodor Yarochkin
On Wed, Jun 8, 2011 at 4:11 AM, Markus Jelsma markus.jel...@openindex.io wrote: Dear nutch developers, Nutch by default takes seeds from file system (-seedDir). Is it possible to change it to take seeds from mysql table? In theory, yes, but i would not recommend it. It would be quite a job

[RESULT] [VOTE] Apache Nutch 1.3 Release Candidate #3

2011-06-07 Thread Mattmann, Chris A (388J)
Hi Folks, This VOTE has passed with the following tallies: +1 Nutch PMC Chris Mattmann Markus Jelsma Julien Nioche Lewis John McGibbney I'll go ahead and push the release to the mirrors and release the Maven repo to Central and then send an ANNOUNCE. Thanks! Cheers, Chris