Re: RSS-fecter and index individul-how can i realize this function

mirkes Wed, 03 Dec 2008 07:25:07 -0800

Where can I find Scott's solution? I am trying to do it exactly like Scott,
but i cannot imagine how to index items separately.
Please, can anybody help me?


Many thanks

Miro


sdeck wrote:
> 
> So, here is what I do for RSS Feeds.
> 
> I parse the rss, and for each outlink, I create the outlink object and set
> inside the anchor text for each outlink a well formed xml string. It
> contains the pub date, description, etc. Now, this is only because I was
> hacking the outlink to just use it's anchor text, but you could always
> just create a new MetaData object for use with an outlink. So, then next
> time that url is called up, and you then get an html parser, then you
> could look at the outlinks metadata and say, hey, look you came from an
> rss feed. So, I can either just use your stored Metadata and not parse the
> html, or I could combine your meta data with what comes from the html,
> etc.
> I have found that to be the best solutions
> 
> Also, when I parse the rss feed, I set a meat tag called "noindex", so in
> my basic indexer, if that is in there, I do not include the rss feed page
> in the Lucene index.
> 
> Scott
> 
> 
> 
> 
> Doug Cutting wrote:
>> 
>> Chris Mattmann wrote:
>>>  Got it. So, the logic behind this is, why bother waiting until the
>>> following fetch to parse (and create ParseData objects from) the RSS
>>> items
>>> out of the feed. Okay, I get it, assuming that the RSS feed has *all* of
>>> the
>>> RSS metadata in it. However, it's perfectly acceptable to have feeds
>>> that
>>> simply have a title, description, and link in it.
>> 
>> Almost.  The feed may have less than the referenced page, but it's also 
>> a lot easier to parse, since the link could be an anchor within a large 
>> page, or could be a page that has lots of navigation links, spam 
>> comments, etc.  So feed entries are generally much more precise than the 
>> pages they reference, and may make for a higher-quality search
>> experience.
>> 
>>> I guess this is still
>>> valuable metadata information to have, however, the only caveat is that
>>> the
>>> implication of the proposed change is:
>>> 
>>> 1. We won't have cached copies, or fetched copies of the Content
>>> represented
>>> by the item links. Therefore, in this model, we won't be able to pull up
>>> a
>>> Nutch cache of the page corresponding to the RSS item, because we are
>>> circumventing the fetch step
>> 
>> Good point.  We indeed wouldn't have these URLs in the cache.
>> 
>>> 2. It sounds like a pretty fundamental API shift in Nutch, to support a
>>> single type of content, RSS. Even if there are more content types that
>>> follow this model, as Doug and Renaud both pointed out, there aren't a
>>> multitude of them (perhaps archive files, but can you think of any
>>> others)?
>> 
>> Also true.  On the other hand, Nutch provides 98% of an RSS search 
>> engine.  It'd be a shame to have to re-invent everything else and it 
>> would be great if Nutch could evolve to support RSS well.
>> 
>> Could image search might also benefit from this?  One could generate a 
>> Parse for each image on a page whose text was from the page.  Product 
>> search too, perhaps.
>> 
>>> The other main thing that comes to mind about this for me is it prevents
>>> the
>>> fetched Content for the RSS items from being able to provide useful
>>> metadata, in the sense that it doesn't explicitly fetch the content.
>>> What if
>>> we wanted to apply some super cool metadata extractor X that used
>>> word-stemming, HTML design analysis, and other techniques to extract
>>> metadata from the content pointed to by an RSS item link? In the
>>> proposed
>>> model, we assume that the RSS xml item tag already contains all
>>> necessary
>>> metadata for indexing, which in my mind, limits the model. Does what I
>>> am
>>> saying make sense? I'm not shooting down the issue, I'm just trying to
>>> brainstorm a bit here about the issue.
>> 
>> Sure, the RSS feed may contain less than the page it references, but 
>> that might be all that one wishes to index.  Otherwise, if, e.g., a blog 
>>   includes titles from other recent posts you're going to get lots of 
>> false positives.  Ideally Nutch should support various options: 
>> searching the feed only, searching the referenced page only, or perhaps 
>> searching both.
>> 
>> Doug
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tp8722009p20815016.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: RSS-fecter and index individul-how can i realize this function

Reply via email to