Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

Renaud Richardet Tue, 06 Feb 2007 15:18:27 -0800

Doug Cutting wrote:
> Renaud Richardet wrote:
>> The usecase is that you index RSS-feeds, but your users can search 
>> each feed-entry as a single document. Does it makes sense?
>
> But each feed item also contains a link whose content will be indexed 
> and that's generally a superset of the item.  
Agreed
> So should there be two urls indexed per item?  
I don't think so
> In many cases, the best thing to do is to index only the linked page, 
> not the feed item at all.  In some (rare?) cases, there might be items 
> without a link, whose only content is directly in the feed, or where 
> the content in the feed is complementary to that in the linked page.  
> In these cases it might be useful to combine the two (the feed item 
> and the linked content), indexing both.  The proposed change might 
> permit that.  Is that the case you're concerned about?
I see. I was thinking that I could index the feed items without having 
to fetch them individually.


More fundamentally, I want to index only the blog-entry text, and not 
the elements around it (header, menus, ads, ...), so as to improve the 
search results.

Here's my case, the proposed changes would allow me to do (*)

1) parse feeds:

for each (feedentry : feed) do
|
|  if (full-text entries) then
|   |  index each feed entry as a single document; blog header, menus 
are not indexed. *
|  else
|   |  create a "special outlink" for each feed entry, which include 
metadata (content, time, etc)
|  endif
|
done

2) on a next fetch loop:

for each (link) do
|
|  if (this is a normal link)
|    |  fetch it and index it normally
|  else if (this link come from an already indexed feed entry) then
|    |  end, do not fetch it *
|  else if (this is a "special outlink")
|    |  guess which DOM nodes hold the post content
|    |  index it; blog header, menus are not indexed.
|  endif
|
done


Thanks,
Renaud

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

Reply via email to