I will try it. Many thanks.
2006/2/20, Andrzej Bialecki <[EMAIL PROTECTED]>:
>
> Elwin wrote:
> > No I don't try to do that. I just use the default paser for the plguin.
> It
> > seems that it works well now.
> > Thx.
> >
>
> I often find TagSoup performing better than NekoHTML. In case of some
>
But I also find a problem. Some links extracted from a page may have some
internal spaces like "http://www.domain.com/sub/dynamic.0001.html".
I guess which is caused by the style file of the page. The link can be
extracted but in fact it's a wrong link, which can't be followed further.
2
Elwin wrote:
No I don't try to do that. I just use the default paser for the plguin. It
seems that it works well now.
Thx.
I often find TagSoup performing better than NekoHTML. In case of some
grave HTML errors Neko tends to simply truncate the document, while
TagSoup just "keeps on trucki
No I don't try to do that. I just use the default paser for the plguin. It
seems that it works well now.
Thx.
2006/2/20, Andrzej Bialecki <[EMAIL PROTECTED]>:
>
> Elwin wrote:
> > Yes, it's true, although it's not the cause of my problem.
> >
>
> Did you try to use the alternative HTML parser (Tag
Elwin wrote:
Yes, it's true, although it's not the cause of my problem.
Did you try to use the alternative HTML parser (TagSoup) supported by
the plugin? You need to set a property "parser.html.impl" to "tagsoup".
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
l is from a Chinese site; however you can just skip those
> non-Enligsh
> > contents and just see the html elements.
> >
> > 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>:
> >> Hi Elwin
> >> Can you provide samples of not working links and code? And put it i
-Ursprüngliche Nachricht-
> Von: Elwin [mailto:[EMAIL PROTECTED]
> Gesendet: Freitag, 17. Februar 2006 09:36
> An: nutch-user@lucene.apache.org
> Betreff: Re: extract links problem with parse-html plugin
>
> I have wrote a test class HtmlWrapper and here is some code:
t; truncated;
> otherwise, no truncation at all.
>
>
>
> Kind regards
>
> Matthias
> -Ursprüngliche Nachricht-
> Von: Elwin [mailto:[EMAIL PROTECTED]
> Gesendet: Freitag, 17. Februar 2006 09:36
> An: nutch-user@lucene.apache.org
> Betreff: Re: extract links problem wit
all.
Kind regards
Matthias
-Ursprüngliche Nachricht-
Von: Elwin [mailto:[EMAIL PROTECTED]
Gesendet: Freitag, 17. Februar 2006 09:36
An: nutch-user@lucene.apache.org
Betreff: Re: extract links problem with parse-html plugin
I have wrote a test class HtmlWrapper and here is some c
in [mailto:[EMAIL PROTECTED]
> Gesendet: Fr 17.02.2006 08:51
> An: nutch-user@lucene.apache.org
> Betreff: extract links problem with parse-html plugin
>
> It seems that the parse-html plguin may not process many pages well,
> because
> I have found that the plugin can't extract al
1
> An: nutch-user@lucene.apache.org
> Betreff: extract links problem with parse-html plugin
>
> It seems that the parse-html plguin may not process many pages well,
because
> I have found that the plugin can't extract all valid links in a page when
I
> test it in my code.
>
I determined the same.
With my Site is the HTML source 160 kByte per Page largely.
The Parser has here definitely problems (whether Javascript on a side is
used or not).
Before my decision for Nutch I tested the Java/Lucene based open source
solution Oxygen ( http://sourceforge.net/projects/oxyu
Hi Elwin
Can you provide samples of not working links and code? And put it into JIRA?
Kind regards
Matthias
-Ursprüngliche Nachricht-
Von: Elwin [mailto:[EMAIL PROTECTED]
Gesendet: Fr 17.02.2006 08:51
An: nutch-user@lucene.apache.org
Betreff: extract links problem with parse-html plugin
It seems that the parse-html plguin may not process many pages well, because
I have found that the plugin can't extract all valid links in a page when I
test it in my code.
I guess that it may be caused by the style of a html page? When I "view
source" of a html page I used to parse, I saw that som
14 matches
Mail list logo