Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin
I will try it. Many thanks. 2006/2/20, Andrzej Bialecki <[EMAIL PROTECTED]>: > > Elwin wrote: > > No I don't try to do that. I just use the default paser for the plguin. > It > > seems that it works well now. > > Thx. > > > > I often find TagSoup performing better than NekoHTML. In case of some >

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin
But I also find a problem. Some links extracted from a page may have some internal spaces like "http://www.domain.com/sub/dynamic.0001.html". I guess which is caused by the style file of the page. The link can be extracted but in fact it's a wrong link, which can't be followed further. 2

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Andrzej Bialecki
Elwin wrote: No I don't try to do that. I just use the default paser for the plguin. It seems that it works well now. Thx. I often find TagSoup performing better than NekoHTML. In case of some grave HTML errors Neko tends to simply truncate the document, while TagSoup just "keeps on trucki

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin
No I don't try to do that. I just use the default paser for the plguin. It seems that it works well now. Thx. 2006/2/20, Andrzej Bialecki <[EMAIL PROTECTED]>: > > Elwin wrote: > > Yes, it's true, although it's not the cause of my problem. > > > > Did you try to use the alternative HTML parser (Tag

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Andrzej Bialecki
Elwin wrote: Yes, it's true, although it's not the cause of my problem. Did you try to use the alternative HTML parser (TagSoup) supported by the plugin? You need to set a property "parser.html.impl" to "tagsoup". -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin
l is from a Chinese site; however you can just skip those > non-Enligsh > > contents and just see the html elements. > > > > 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>: > >> Hi Elwin > >> Can you provide samples of not working links and code? And put it i

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Piotr Kosiorowski
-Ursprüngliche Nachricht- > Von: Elwin [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 17. Februar 2006 09:36 > An: nutch-user@lucene.apache.org > Betreff: Re: extract links problem with parse-html plugin > > I have wrote a test class HtmlWrapper and here is some code:

Re: extract links problem with parse-html plugin

2006-02-17 Thread Elwin
t; truncated; > otherwise, no truncation at all. > > > > Kind regards > > Matthias > -Ursprüngliche Nachricht- > Von: Elwin [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 17. Februar 2006 09:36 > An: nutch-user@lucene.apache.org > Betreff: Re: extract links problem wit

AW: extract links problem with parse-html plugin

2006-02-17 Thread Guenter, Matthias
all. Kind regards Matthias -Ursprüngliche Nachricht- Von: Elwin [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 17. Februar 2006 09:36 An: nutch-user@lucene.apache.org Betreff: Re: extract links problem with parse-html plugin I have wrote a test class HtmlWrapper and here is some c

Re: extract links problem with parse-html plugin

2006-02-17 Thread Elwin
in [mailto:[EMAIL PROTECTED] > Gesendet: Fr 17.02.2006 08:51 > An: nutch-user@lucene.apache.org > Betreff: extract links problem with parse-html plugin > > It seems that the parse-html plguin may not process many pages well, > because > I have found that the plugin can't extract al

Re: AW: extract links problem with parse-html plugin

2006-02-17 Thread Poettgen
1 > An: nutch-user@lucene.apache.org > Betreff: extract links problem with parse-html plugin > > It seems that the parse-html plguin may not process many pages well, because > I have found that the plugin can't extract all valid links in a page when I > test it in my code. >

Re: extract links problem with parse-html plugin

2006-02-17 Thread Poettgen
I determined the same. With my Site is the HTML source 160 kByte per Page largely. The Parser has here definitely problems (whether Javascript on a side is used or not). Before my decision for Nutch I tested the Java/Lucene based open source solution Oxygen ( http://sourceforge.net/projects/oxyu

AW: extract links problem with parse-html plugin

2006-02-17 Thread Guenter, Matthias
Hi Elwin Can you provide samples of not working links and code? And put it into JIRA? Kind regards Matthias -Ursprüngliche Nachricht- Von: Elwin [mailto:[EMAIL PROTECTED] Gesendet: Fr 17.02.2006 08:51 An: nutch-user@lucene.apache.org Betreff: extract links problem with parse-html plugin

extract links problem with parse-html plugin

2006-02-16 Thread Elwin
It seems that the parse-html plguin may not process many pages well, because I have found that the plugin can't extract all valid links in a page when I test it in my code. I guess that it may be caused by the style of a html page? When I "view source" of a html page I used to parse, I saw that som