Re: RSS-fecter and index individul-how can i realize this function

Renaud Richardet Tue, 06 Feb 2007 15:17:31 -0800

Doug Cutting wrote:

Renaud Richardet wrote:
The usecase is that you index RSS-feeds, but your users can searcheach feed-entry as a single document. Does it makes sense?
But each feed item also contains a link whose content will be indexedand that's generally a superset of the item.

Agreed

So should there be two urls indexed per item?

I don't think so

In many cases, the best thing to do is to index only the linked page,not the feed item at all. In some (rare?) cases, there might be itemswithout a link, whose only content is directly in the feed, or wherethe content in the feed is complementary to that in the linked page.In these cases it might be useful to combine the two (the feed itemand the linked content), indexing both. The proposed change mightpermit that. Is that the case you're concerned about?

I see. I was thinking that I could index the feed items without havingto fetch them individually.

More fundamentally, I want to index only the blog-entry text, and notthe elements around it (header, menus, ads, ...), so as to improve thesearch results.


Here's my case, the proposed changes would allow me to do (*)

1) parse feeds:

for each (feedentry : feed) do
|
|  if (full-text entries) then

| | index each feed entry as a single document; blog header, menusare not indexed. *

|  else

| | create a "special outlink" for each feed entry, which includemetadata (content, time, etc)

|  endif
|
done

2) on a next fetch loop:

for each (link) do
|
|  if (this is a normal link)
|    |  fetch it and index it normally
|  else if (this link come from an already indexed feed entry) then
|    |  end, do not fetch it *
|  else if (this is a "special outlink")
|    |  guess which DOM nodes hold the post content
|    |  index it; blog header, menus are not indexed.
|  endif
|
done


Thanks,
Renaud

Re: RSS-fecter and index individul-how can i realize this function

Reply via email to