[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476297 ]
Andrzej Bialecki commented on NUTCH-443: ----------------------------------------- Overall the idea of this improvement looks very useful, but I'm -1 on this patch as it looks right now (only now I had a chance to review it in detail). I'd like to see the following issues addressed in this patch before it's committed: * in my opinion it's easier to add missing CrawlDatum's (with correctly set fetch time) for the new urls to the output rather than work-around this by passing around the fetch time in metadata, and then again compensating in Indexer and CrawlDbReducer for the lack of these fetchDatum-s .. * in Fetcher / Fetcher2 you don't pass the signature in case when there is no valid Parse output, but in the current versions of Fetchers the signature is still calculated and passed in datum.setSignature() (which ends up in crawl_fetch). * using a generic Map<String, Parse> is IMHO inappropriate, as I indicated earlier, especially since this Map requires special post-processing in ParseUtil.processParseMap - and what would happen if I didn't use ParseUtil? I think this calls for a special-purpose class (ParseResult?), which would encapsulate this behavior without exposing it to its users (or even worse - allowing users to bypass it). This class would also help us to avoid somewhat ugly "convenience" methods in ParseStatus and ParseImpl - these details would be hidden in one of the constructors of ParseResult. * I'm also not sure why we use Map<String, Parse> and not Map<Text, Parse>, since in all further processing we need to create Text objects ... * the new section in HtmlParseFilters breaks the loop on encountering the first error, and leaves the parse results incompletely filtered. It should simply continue - the result is an aggregation of more or less independent documents that are parsed on their own. * the comment about redirects in Parser.java is misplaced - I think this contract should be both defined and enforced in the Fetcher. And finally, I think this is a significant change in the way how content parsers work with the rest of the framework, so we should wait with this patch after the 0.9 release - and we should push 0.9 out of the door really soon ... > allow parsers to return multiple Parse object, this will speed up the rss > parser > -------------------------------------------------------------------------------- > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann > Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.