Where can I find Scott's solution? I am trying to do it exactly like Scott, but i cannot imagine how to index items separately. Please, can anybody help me?
Many thanks Miro sdeck wrote: > > So, here is what I do for RSS Feeds. > > I parse the rss, and for each outlink, I create the outlink object and set > inside the anchor text for each outlink a well formed xml string. It > contains the pub date, description, etc. Now, this is only because I was > hacking the outlink to just use it's anchor text, but you could always > just create a new MetaData object for use with an outlink. So, then next > time that url is called up, and you then get an html parser, then you > could look at the outlinks metadata and say, hey, look you came from an > rss feed. So, I can either just use your stored Metadata and not parse the > html, or I could combine your meta data with what comes from the html, > etc. > I have found that to be the best solutions > > Also, when I parse the rss feed, I set a meat tag called "noindex", so in > my basic indexer, if that is in there, I do not include the rss feed page > in the Lucene index. > > Scott > > > > > Doug Cutting wrote: >> >> Chris Mattmann wrote: >>> Got it. So, the logic behind this is, why bother waiting until the >>> following fetch to parse (and create ParseData objects from) the RSS >>> items >>> out of the feed. Okay, I get it, assuming that the RSS feed has *all* of >>> the >>> RSS metadata in it. However, it's perfectly acceptable to have feeds >>> that >>> simply have a title, description, and link in it. >> >> Almost. The feed may have less than the referenced page, but it's also >> a lot easier to parse, since the link could be an anchor within a large >> page, or could be a page that has lots of navigation links, spam >> comments, etc. So feed entries are generally much more precise than the >> pages they reference, and may make for a higher-quality search >> experience. >> >>> I guess this is still >>> valuable metadata information to have, however, the only caveat is that >>> the >>> implication of the proposed change is: >>> >>> 1. We won't have cached copies, or fetched copies of the Content >>> represented >>> by the item links. Therefore, in this model, we won't be able to pull up >>> a >>> Nutch cache of the page corresponding to the RSS item, because we are >>> circumventing the fetch step >> >> Good point. We indeed wouldn't have these URLs in the cache. >> >>> 2. It sounds like a pretty fundamental API shift in Nutch, to support a >>> single type of content, RSS. Even if there are more content types that >>> follow this model, as Doug and Renaud both pointed out, there aren't a >>> multitude of them (perhaps archive files, but can you think of any >>> others)? >> >> Also true. On the other hand, Nutch provides 98% of an RSS search >> engine. It'd be a shame to have to re-invent everything else and it >> would be great if Nutch could evolve to support RSS well. >> >> Could image search might also benefit from this? One could generate a >> Parse for each image on a page whose text was from the page. Product >> search too, perhaps. >> >>> The other main thing that comes to mind about this for me is it prevents >>> the >>> fetched Content for the RSS items from being able to provide useful >>> metadata, in the sense that it doesn't explicitly fetch the content. >>> What if >>> we wanted to apply some super cool metadata extractor X that used >>> word-stemming, HTML design analysis, and other techniques to extract >>> metadata from the content pointed to by an RSS item link? In the >>> proposed >>> model, we assume that the RSS xml item tag already contains all >>> necessary >>> metadata for indexing, which in my mind, limits the model. Does what I >>> am >>> saying make sense? I'm not shooting down the issue, I'm just trying to >>> brainstorm a bit here about the issue. >> >> Sure, the RSS feed may contain less than the page it references, but >> that might be all that one wishes to index. Otherwise, if, e.g., a blog >> includes titles from other recent posts you're going to get lots of >> false positives. Ideally Nutch should support various options: >> searching the feed only, searching the referenced page only, or perhaps >> searching both. >> >> Doug >> >> > > -- View this message in context: http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tp8722009p20815016.html Sent from the Nutch - Dev mailing list archive at Nabble.com.