> Actually, this isn't entirely the case. parse-rss actually indexes the item text (see line 148 in RSSParser.java) as well. Additionally, parse-rss adds the individual item links to the Outlinks (see lines 161 and 163 in RSSParser.java) , and they get crawled as well, in addition to the channel text (see line 123 in RSSParser.java) and channel outlink (see lines 130 and 132 in RSSParser.java).
Yep, I wasn't clear enough maybe. Sorry Chris ;) RSSParser actually reads the items and allows to index the concated text. But they are not individually returned and then can't be individually indexed right away. But if you decide to fetch and parse each item "link", parse-rss actually returns all the links. Then you could extract the item text or do other parsing for each individual item page. Sorry if I confused some people. I am personally focusing on only RSS and I am trying to index as much as I can from the RSS feed directly to avoid to have to extract the item text from the full HTML page. Of course, I then limit myself to whatever I have in the feed. > I haven't really noticed any formats not really handled by commons-feedparser. What formats have you noticed that it doesn't handle? I think I had problems with ATOM <content> from feeds like this one: http://meetvinz.blogspot.com/atom.xml and the RSS <content:encoded> for instance from http://feeds.feedburner.com/TechCrunch Was it my mistake? If it was, I'd love to go back to feedparser, as it is apparently faster than ROME. ;) > > > -----Original Message----- > From: Dima Gritsenko [mailto:[EMAIL PROTECTED] > Sent: Monday, August 28, 2006 10:44 AM > To: [email protected] > Subject: RSS search by nutch > > Hi, > > Does nutch have a class for searching incoming RSS feeds in real time? > Thank you. > Dima. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
