Doug Cutting wrote:
Renaud Richardet wrote:
The usecase is that you index RSS-feeds, but your users can search each feed-entry as a single document. Does it makes sense?

But each feed item also contains a link whose content will be indexed and that's generally a superset of the item.
Agreed
So should there be two urls indexed per item?
I don't think so
In many cases, the best thing to do is to index only the linked page, not the feed item at all. In some (rare?) cases, there might be items without a link, whose only content is directly in the feed, or where the content in the feed is complementary to that in the linked page. In these cases it might be useful to combine the two (the feed item and the linked content), indexing both. The proposed change might permit that. Is that the case you're concerned about?
I see. I was thinking that I could index the feed items without having to fetch them individually.

More fundamentally, I want to index only the blog-entry text, and not the elements around it (header, menus, ads, ...), so as to improve the search results.

Here's my case, the proposed changes would allow me to do (*)

1) parse feeds:

for each (feedentry : feed) do
|
|  if (full-text entries) then
| | index each feed entry as a single document; blog header, menus are not indexed. *
|  else
| | create a "special outlink" for each feed entry, which include metadata (content, time, etc)
|  endif
|
done

2) on a next fetch loop:

for each (link) do
|
|  if (this is a normal link)
|    |  fetch it and index it normally
|  else if (this link come from an already indexed feed entry) then
|    |  end, do not fetch it *
|  else if (this is a "special outlink")
|    |  guess which DOM nodes hold the post content
|    |  index it; blog header, menus are not indexed.
|  endif
|
done


Thanks,
Renaud

Reply via email to