[
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476297
]
Andrzej Bialecki commented on NUTCH-443:
-----------------------------------------
Overall the idea of this improvement looks very useful, but I'm -1 on this
patch as it looks right now (only now I had a chance to review it in detail).
I'd like to see the following issues addressed in this patch before it's
committed:
* in my opinion it's easier to add missing CrawlDatum's (with correctly set
fetch time) for the new urls to the output rather than work-around this by
passing around the fetch time in metadata, and then again compensating in
Indexer and CrawlDbReducer for the lack of these fetchDatum-s ..
* in Fetcher / Fetcher2 you don't pass the signature in case when there is no
valid Parse output, but in the current versions of Fetchers the signature is
still calculated and passed in datum.setSignature() (which ends up in
crawl_fetch).
* using a generic Map<String, Parse> is IMHO inappropriate, as I indicated
earlier, especially since this Map requires special post-processing in
ParseUtil.processParseMap - and what would happen if I didn't use ParseUtil? I
think this calls for a special-purpose class (ParseResult?), which would
encapsulate this behavior without exposing it to its users (or even worse -
allowing users to bypass it). This class would also help us to avoid somewhat
ugly "convenience" methods in ParseStatus and ParseImpl - these details would
be hidden in one of the constructors of ParseResult.
* I'm also not sure why we use Map<String, Parse> and not Map<Text, Parse>,
since in all further processing we need to create Text objects ...
* the new section in HtmlParseFilters breaks the loop on encountering the first
error, and leaves the parse results incompletely filtered. It should simply
continue - the result is an aggregation of more or less independent documents
that are parsed on their own.
* the comment about redirects in Parser.java is misplaced - I think this
contract should be both defined and enforced in the Fetcher.
And finally, I think this is a significant change in the way how content
parsers work with the rest of the framework, so we should wait with this patch
after the 0.9 release - and we should push 0.9 out of the door really soon ...
> allow parsers to return multiple Parse object, this will speed up the rss
> parser
> --------------------------------------------------------------------------------
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Affects Versions: 0.9.0
> Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
> Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch,
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch,
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch,
> NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch,
> parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser
> can return multiple parse objects, that will all be indexed separately.
> Advantage: no need to fetch all feed-items separately.
> see the discussion at
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers