[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476600 ]
Andrzej Bialecki commented on NUTCH-443: ----------------------------------------- Almost there ... ParseResult seemed to tidy up this patch quite a bit. Remaining issues: * you create the "fake" CrawlDatum-s in ParseOutputFormat, and then set fetchTime to the current time. This is incorrect - parsing may have been performed long after the content was fetched. The correct place to create and store these "fake" CrawlDatum-s is in the FetcherThread.output(), where you loop through Entry<Text, Parse>, i.e.: long curTime = System.currentTimeMillis(); for (Entry<Text, Parse> entry : parseResult) { Text k = entry.getKey(); output.collect(k, new ObjectWritable(new ParseImpl(entry.getValue()))); if (!k.equals(key)) { CrawlDatum fake = datum.clone(); fake.set fake.setFetchTime(curTime); output.collect(k, new ObjectWritable(fake)); } else { // save the real datum output.collect(k, new ObjectWritable(datum)); } } * I'm pretty sure that ParseResult.filter() must NOT be called under normal circumstances ... We need to store the information that parsing was unsuccessful - if we remove this information from the ParseResult we will never know that parsing failed for this content (or a part thereof). * we have a backward-compatibility issue with ParseImpl.isFetched - i.e. data created with earlier versions of Nutch won't be compatible with the new format, and there is no versioning information in the already existing data. We need to do one of the following: - bite the bullet, and don't care about backward compatibility - not so nice ... all existing segments will have to be re-parsed. Ouch. - add look-ahead code to test the data coming from DataInput if it contains this boolean flag or a likely Text length - somewhat unreliable... - store this flag in ParseData.contentMeta - ugly hack. Out of these three the last option seems the safest for now. From the long-term point of view we should later on add versioning information and handling of different versions in Parse. * the name of this method Parse.isFetched is somewhat misleading - it's not about fetching or not, it's whether this Parse corresponds to the original url or to a sub-url. Perhaps isCanonical, isRoot, or some other name ...? * in ParseSegment - what's the reason for creating a new copy of ParseImpl in this line below? I think we should store the one we already have in "parse": output.collect(url, new ParseImpl(new ParseText(parse.getText()), parse.getData(), parse.isFetched())); Thank you for your perseverance! > allow parsers to return multiple Parse object, this will speed up the rss > parser > -------------------------------------------------------------------------------- > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann > Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, > parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.