Re: extract links problem with parse-html plugin

2006-02-17 Thread Elwin
t; truncated; > otherwise, no truncation at all. > > > > Kind regards > > Matthias > -Ursprüngliche Nachricht- > Von: Elwin [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 17. Februar 2006 09:36 > An: nutch-user@lucene.apache.org > Betreff: Re: extract links problem wit

Re: extract links problem with parse-html plugin

2006-02-17 Thread Elwin
I have wrote a test class HtmlWrapper and here is some code: HtmlWrapper wrapper=new HtmlWrapper(); Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html";); String temp=new String(c.getContent()); System.out.println(temp); wrapper.parseHttpContent(c); // get all outlinks

Re: extract links problem with parse-html plugin

2006-02-17 Thread Poettgen
I determined the same. With my Site is the HTML source 160 kByte per Page largely. The Parser has here definitely problems (whether Javascript on a side is used or not). Before my decision for Nutch I tested the Java/Lucene based open source solution Oxygen ( http://sourceforge.net/projects/oxyu