On Mon, Jan 5, 2009 at 12:32 PM, Doğacan Güney <doga...@gmail.com> wrote: > On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau <vlad...@gmail.com> wrote: >> Hello >> I'm trying to make RSSParser do something simmilar to FeedParser (which >> doesn't work quite right) - that is, instead of indexing the whole contents > > Why doesn't FeedParser work? Let's fix whatever is broken in it :D > >> of the feed, I want it to show individual items, with their respective title >> and and proper link to the article I realize that I could index 1 depth >> more, but I'd like to index just the feed, not the articles that go with it >> (keep the index small and the crawl fast). >> >> For each item in each RSS channel (the code does not differ much for >> getParse() of RSSParser.java) I do something like >> >> Outlink[] outlinks = new Outlink[1]; >> try{ >> outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle()); >> } catch (Exception e) { >> continue; >> } >> >> parseResult.put( >> whichLink, >> new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()), >> new ParseData( >> ParseStatus.STATUS_SUCCESS, >> theRSSItem.getTitle(), >> outlinks, >> new Metadata() //was content.getMetadata() >> ) >> ); >> >> The problem is, however, that only one item from the whole RSS gets into the >> index, although in the log I can see them all ( I've tried it with feeds >> from cnn and reuters). What happens? Why do they get overwritten in a >> seemingly random order? The item that makes it into the index is neither the >> first nor the last, but appears to be the same until new items appear in the >> feed. >> >> Thank you, >> Vlad >> >> > > > > -- > Doğacan Güney >
when using FeedParser, not all of the feeds make it into the index. For example, I crawl both Entertainment and Politics, but I get results only for some of the articles. Is there any way to check wether or not entries make it into the index? I see, in the log "Indexing http://rss.cnn.com/... with analyzer org.apache.nutch.analyzer.NutchDocumentAnalyzer (something)" (I'm not able to crawl right now, since I don't have access to the machine). But when I look for keywords specific to some of the documents, I don't get any results :-(