t; truncated;
> otherwise, no truncation at all.
>
>
>
> Kind regards
>
> Matthias
> -Ursprüngliche Nachricht-
> Von: Elwin [mailto:[EMAIL PROTECTED]
> Gesendet: Freitag, 17. Februar 2006 09:36
> An: nutch-user@lucene.apache.org
> Betreff: Re: extract links problem wit
I have wrote a test class HtmlWrapper and here is some code:
HtmlWrapper wrapper=new HtmlWrapper();
Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html";);
String temp=new String(c.getContent());
System.out.println(temp);
wrapper.parseHttpContent(c); // get all outlinks
I determined the same.
With my Site is the HTML source 160 kByte per Page largely.
The Parser has here definitely problems (whether Javascript on a side is
used or not).
Before my decision for Nutch I tested the Java/Lucene based open source
solution Oxygen ( http://sourceforge.net/projects/oxyu