date:20110607

Character encoding on Html-Pages

2011-06-07 Thread Alex F

Hi, the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not suitable for sites using single quotes for meta http-equiv Example: meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' We experienced a couple of pages with that kind of quotes and

Re: Character encoding on Html-Pages

2011-06-07 Thread lewis john mcgibbney

Hi Alex, I cannot locate the java file you mention at org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3... Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions above it is identical) it appears that you are right the double quotes for meta http-equiv

Re: Character encoding on Html-Pages

2011-06-07 Thread Markus Jelsma

Hi, It is a plugin found in src/plugins/parse-html/. Cheers On Tuesday 07 June 2011 18:01:22 lewis john mcgibbney wrote: Hi Alex, I cannot locate the java file you mention at org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3... Having a quick look at

Re: keeping index up to date

2011-06-07 Thread alxsss

Hi, I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments? Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps. Thanks. Alex.

Re: keeping index up to date

2011-06-07 Thread Markus Jelsma

Hi, I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments? Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps. No, the

Re: Character encoding on Html-Pages

2011-06-07 Thread Markus Jelsma

Ticket: https://issues.apache.org/jira/browse/NUTCH-1006 Hi, the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not suitable for sites using single quotes for meta http-equiv Example: meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1' We

Re: keeping index up to date

2011-06-07 Thread lewis john mcgibbney

Hi, To add to Markus' comments, if you take a look at the script it is written in such a way that if run in safe mode it protects us against an error which may occur. If this is the case we an recover segments etc and take appropriate actions to resolve. On Tue, Jun 7, 2011 at 9:01 PM, Markus

Re: Nutch not crawling on a pre-existing hadoop cluster?

2011-06-07 Thread Julien Nioche

Hi Brian, Would be easier to simply generate a job file and the script in bin to run the tasks. Hardcopying the plugins + jars on each machine is not practical. The reason we separated the jars+plugins approach from the job in the runtimes for 1.3 was to avoid possible conflicts. Julien I

Re: Dump all urls from merged index

2011-06-07 Thread Julien Nioche

or you can modify the code from the crawldb reader and get it to dump only the keys. If your crawldb is large, regex will take forever On 7 June 2011 22:31, Markus Jelsma markus.jel...@openindex.io wrote: Well, you can dump the crawldb using the bin/nutch readdb command. You'd still need to

Re: Custom seed source

2011-06-07 Thread Fyodor Yarochkin

On Wed, Jun 8, 2011 at 4:11 AM, Markus Jelsma markus.jel...@openindex.io wrote: Dear nutch developers, Nutch by default takes seeds from file system (-seedDir). Is it possible to change it to take seeds from mysql table? In theory, yes, but i would not recommend it. It would be quite a job

[RESULT] [VOTE] Apache Nutch 1.3 Release Candidate #3

2011-06-07 Thread Mattmann, Chris A (388J)

Hi Folks, This VOTE has passed with the following tallies: +1 Nutch PMC Chris Mattmann Markus Jelsma Julien Nioche Lewis John McGibbney I'll go ahead and push the release to the mirrors and release the Maven repo to Central and then send an ANNOUNCE. Thanks! Cheers, Chris

Character encoding on Html-Pages

Re: Character encoding on Html-Pages

Re: Character encoding on Html-Pages

Re: keeping index up to date

Re: keeping index up to date

Re: Character encoding on Html-Pages

Re: keeping index up to date

Re: Nutch not crawling on a pre-existing hadoop cluster?

Re: Dump all urls from merged index

Re: Custom seed source

[RESULT] [VOTE] Apache Nutch 1.3 Release Candidate #3

11 matches

Site Navigation

Mail list logo

Footer information