I have wrote a test class HtmlWrapper and here is some code: HtmlWrapper wrapper=new HtmlWrapper(); Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html"); String temp=new String(c.getContent()); System.out.println(temp);
wrapper.parseHttpContent(c); // get all outlinks into a ArrayList ArrayList links=wrapper.getBlogLinks(); for(int i=0;i<links.size();i++){ String urlString=(String)links.get(i); System.out.println(urlString); } I can only get a few of links from that page. The url is from a Chinese site; however you can just skip those non-Enligsh contents and just see the html elements. 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>: > > Hi Elwin > Can you provide samples of not working links and code? And put it into > JIRA? > Kind regards > Matthias > > > > -----Ursprüngliche Nachricht----- > Von: Elwin [mailto:[EMAIL PROTECTED] > Gesendet: Fr 17.02.2006 08:51 > An: nutch-user@lucene.apache.org > Betreff: extract links problem with parse-html plugin > > It seems that the parse-html plguin may not process many pages well, > because > I have found that the plugin can't extract all valid links in a page when > I > test it in my code. > I guess that it may be caused by the style of a html page? When I "view > source" of a html page I used to parse, I saw that some elements in the > source are segmented by some unrequired spaces. However, the situation is > quiet often to the pages of large portal sites or news sites. > > -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。