Doug Cutting wrote:
Renaud Richardet wrote:
The usecase is that you index RSS-feeds, but your users can search
each feed-entry as a single document. Does it makes sense?
But each feed item also contains a link whose content will be indexed
and that's generally a superset of the item.
Agreed
So should there be two urls indexed per item?
I don't think so
In many cases, the best thing to do is to index only the linked page,
not the feed item at all. In some (rare?) cases, there might be items
without a link, whose only content is directly in the feed, or where
the content in the feed is complementary to that in the linked page.
In these cases it might be useful to combine the two (the feed item
and the linked content), indexing both. The proposed change might
permit that. Is that the case you're concerned about?
I see. I was thinking that I could index the feed items without having
to fetch them individually.
More fundamentally, I want to index only the blog-entry text, and not
the elements around it (header, menus, ads, ...), so as to improve the
search results.
Here's my case, the proposed changes would allow me to do (*)
1) parse feeds:
for each (feedentry : feed) do
|
| if (full-text entries) then
| | index each feed entry as a single document; blog header, menus
are not indexed. *
| else
| | create a "special outlink" for each feed entry, which include
metadata (content, time, etc)
| endif
|
done
2) on a next fetch loop:
for each (link) do
|
| if (this is a normal link)
| | fetch it and index it normally
| else if (this link come from an already indexed feed entry) then
| | end, do not fetch it *
| else if (this is a "special outlink")
| | guess which DOM nodes hold the post content
| | index it; blog header, menus are not indexed.
| endif
|
done
Thanks,
Renaud