Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Pike
Hi Chris > There are currently 2 plugins that parse feeds and get them indexed: > parse-rss - older, but gets the job done > feed - newer, and takes advantage of the ability to parse/index feeds in > one step, rather than in many [..] > Parse-rss indexes the whole feed, whereas the feed plugi

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann
Hi Brian, Sorry for taking so long to reply. Here ya go: > Do you have any URLs for feeds that are reliably parsed and indexed by > the feed parser? I haven't tested/used this plugin in a quite a while. There was someone on the nutch-user list before, nutch.newbie, that was doing quite a bit

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann
Hi Pike, Parse-rss indexes the whole feed, whereas the feed plugin takes advantage of NUTCH-443, which allows Parsers to return multiple Parse objects, which indexes each item in the feed as its own record. HTH, Chris On 10/15/07 7:25 AM, "Pike" <[EMAIL PROTECTED]> wrote: > Hi > >>> I hav

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Pike
Hi >> I have this with all results: what is indexed >> seems to be 1 record per feed, containing a >> parsed version of the content including all its items, >> with sometimes bits of xml and html markup in it. >> >> I was assuming this is the intended behaviour ? > > It may well be the intended

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-15 Thread Rick Moynihan
Pike wrote: Hi Ricky, Chris I've not noticed much difference, with both plugins failing on the feedburner feed: - http://feeds.feedburner.com/Techcrunch Strange, but that feed is indeed invalid xml if I wget it. It starts with newlines and ends with comments. Very picky, but that's not all

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-12 Thread Pike
Hi Ricky, Chris > I've not noticed much > difference, with both plugins failing on the feedburner feed: > > - http://feeds.feedburner.com/Techcrunch > Strange, but that feed is indeed invalid xml if I wget it. It starts with newlines and ends with comments. Very picky, but that's not allowed af

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-12 Thread Rick Moynihan
Chris Mattmann wrote: There are currently 2 plugins that parse feeds and get them indexed: parse-rss - older, but gets the job done feed - newer, and takes advantage of the ability to parse/index feeds in one step, rather than in many I didn't realise this as I was using 0.9 where only pars

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-11 Thread Brian Ulicny
Chris, Recently, I've been playing around with the feed plugin from the nightly build but unsuccessfully. I can't get any indexed fields from feeds in the wild. Do you have any URLs for feeds that are reliably parsed and indexed by the feed parser? Does it actually index atom at present? There

Re: Indexing Feeds & Blog Posts with Nutch

2007-10-11 Thread Chris Mattmann
Hi Rick, Glad to hear that you're interested in using Nutch! There are currently 2 plugins that parse feeds and get them indexed: parse-rss - older, but gets the job done feed - newer, and takes advantage of the ability to parse/index feeds in one step, rather than in many There are other

Indexing Feeds & Blog Posts with Nutch

2007-10-11 Thread Rick Moynihan
Hi all, I've recently downloaded Nutch v0.9, to experiment in searching blog posts and RSS/Atom feeds. So far I have managed to get it to successfully crawl, index and search some websites. I am now starting my investigations to use Nutch to crawl/index/search news/blog feeds. And have inc