Doug Cutting wrote: > Renaud Richardet wrote: >> The usecase is that you index RSS-feeds, but your users can search >> each feed-entry as a single document. Does it makes sense? > > But each feed item also contains a link whose content will be indexed > and that's generally a superset of the item. Agreed > So should there be two urls indexed per item? I don't think so > In many cases, the best thing to do is to index only the linked page, > not the feed item at all. In some (rare?) cases, there might be items > without a link, whose only content is directly in the feed, or where > the content in the feed is complementary to that in the linked page. > In these cases it might be useful to combine the two (the feed item > and the linked content), indexing both. The proposed change might > permit that. Is that the case you're concerned about? I see. I was thinking that I could index the feed items without having to fetch them individually.
More fundamentally, I want to index only the blog-entry text, and not the elements around it (header, menus, ads, ...), so as to improve the search results. Here's my case, the proposed changes would allow me to do (*) 1) parse feeds: for each (feedentry : feed) do | | if (full-text entries) then | | index each feed entry as a single document; blog header, menus are not indexed. * | else | | create a "special outlink" for each feed entry, which include metadata (content, time, etc) | endif | done 2) on a next fetch loop: for each (link) do | | if (this is a normal link) | | fetch it and index it normally | else if (this link come from an already indexed feed entry) then | | end, do not fetch it * | else if (this is a "special outlink") | | guess which DOM nodes hold the post content | | index it; blog header, menus are not indexed. | endif | done Thanks, Renaud ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
