I've been twiddling with the new feed parsing code in nutch-1.0-dev as a
model for some changes to an existing nutch architecture that I've been
running for a while. Using the new ParseResult structure I'm able to create
a set of ParseData objects for each document I've crawled. All of the
documents I'm crawling are Rss feeds, and I'd like to index each item in the
feeds as it's own document. So everything is working just fine up until the
point where I try to create the index. At that point nothing happens. A
quick check through the Indexer code and a dump file of the segment shows me
that I don't have any CrawlDatum entries with a DB status for the items
parsed from each feed. 

 

I never intended on crawling the items from the feeds, so my question is can
I in the parsing/fetching stage add the db status into the the ParseData for
each individual item? I'd rather not mess around in the Indexer and remove
checks for things that should probably exist.

 

patrik

Reply via email to