[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476297
 ] 

Andrzej Bialecki  commented on NUTCH-443:
-----------------------------------------

Overall the idea of this improvement looks very useful, but I'm -1 on this 
patch as it looks right now (only now I had a chance to review it in detail). 
I'd like to see the following issues addressed in this patch before it's 
committed:

* in my opinion it's easier to add missing CrawlDatum's (with correctly set 
fetch time) for the new urls to the output rather than work-around this by 
passing around the fetch time in metadata, and then again compensating in 
Indexer and CrawlDbReducer for the lack of these fetchDatum-s ..

* in Fetcher / Fetcher2 you don't pass the signature in case when there is no 
valid Parse output, but in the current versions of Fetchers the signature is 
still calculated and passed in datum.setSignature() (which ends up in 
crawl_fetch).

* using a generic Map<String, Parse> is IMHO inappropriate, as I indicated 
earlier, especially since this Map requires special post-processing in 
ParseUtil.processParseMap - and what would happen if I didn't use ParseUtil? I 
think this calls for a special-purpose class (ParseResult?), which would 
encapsulate this behavior without exposing it to its users (or even worse - 
allowing users to bypass it). This class would also help us to avoid somewhat 
ugly "convenience" methods in ParseStatus and ParseImpl - these details would 
be hidden in one of the constructors of ParseResult.

* I'm also not sure why we use Map<String, Parse> and not Map<Text, Parse>, 
since in all further processing we need to create Text objects ...

* the new section in HtmlParseFilters breaks the loop on encountering the first 
error, and leaves the parse results incompletely filtered. It should simply 
continue - the result is an aggregation of more or less independent documents 
that are parsed on their own.

* the comment about redirects in Parser.java is misplaced - I think this 
contract should be both defined and enforced in the Fetcher.


And finally, I think this is a significant change in the way how content 
parsers work with the rest of the framework, so we should wait with this patch 
after the 0.9 release - and we should push 0.9 out of the door really soon ...

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, 
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to