Hi folks : What’s I want to do is to separate a rss file into several pages .
Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure <item> <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title> <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... </description> <link>http://news.sohu.com/20070125 <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</ link> <category>搜狐焦点图新闻</category> <author>[EMAIL PROTECTED] </author> <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate> <comments >http://comment.news.sohu.com <http://comment.news.sohu.com/comment/topic.jsp?id=247833847> /comment/topic.jsp?id=247833847</comments> </item