[
https://issues.apache.org/jira/browse/NUTCH-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16688081#comment-16688081
]
Sebastian Nagel commented on NUTCH-2675:
----------------------------------------
The parse job has no access to the CrawlDb and cannot read or modify the
contained CrawlDatum objects. That's determined by Nutch's architecture which
is based on MapReduce. Adding the CrawlDb as input and output to the parsing
job would have a heavy impact of the job's performance as the CrawlDb is
usually much larger than the segment to be parsed. The MapReduce job
architecture allows to scale the crawls up to billions of pages, but forces a
couple of limitations to the programmer. I have no idea how we could pass the
CrawlDb's CrawlDatum to the parser without more or less rewriting Nutch from
scratch.
> Give parsers the capability to read and write CrawlDatum
> --------------------------------------------------------
>
> Key: NUTCH-2675
> URL: https://issues.apache.org/jira/browse/NUTCH-2675
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.15
> Reporter: Junqiang Zhang
> Priority: Minor
> Fix For: 1.15
>
>
> Parsers are called inside org.apache.nutch.parse.ParseSegment,
> (Line 127 for version 1.15) parseResult = parseUtil.parse(content);
> and inside org.apache.nutch.fetcher.FetcherThread.
> (Line 640 for version 1.15) parseResult =
> this.parseUtil.parse(content);
> The current version of Nutch does not give parsers the capability to access
> CrawlDatum. If users want to customize the parsing process using some
> metadata of CrawlDatum, it is difficult to read the required metadata.
> On the other side, if users want to save metadata generated during parsing,
> the metadata can only be saved as parseMeta of
> org.apache.nutch.parse.ParseData, and those of parseMeta selected by
> db.parsemeta.to.crawldb in nutch-site.xml can be added to CrawlDatum inside
> org.apache.nutch.parse.ParseOutputFormat and
> org.apache.nutch.crawl.CrawlDbReducer. If parsers have direct access to
> CrawlDatum, the metadata generated during parsing can be added to CrawlDatum
> directly by parsers.
> I use Nutch to fetch and parse web pages. To read required metadata from
> CrawlDatum during parsing, I do the following steps to work around.
> (1) During web page fetching, inside
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin, read the
> required metadata from CrawlDatum, and save the required metadata together
> with the Headers metadata of org.apache.nutch.net.protocols.Response to the
> metadata of org.apache.nutch.protocol.Content. This can be done at line 334
> of the code by replacing "response.getHeaders()" by a new metadata containing
> both the required metadata from CrawlDatum and the Headers metadata.
> The code need to be modified inside
> org.apache.nutch.protocol.http.api.HttpBase of lib-http plugin is
> (Line 332 for version 1.15) Content c = new Content(u.toString(),
> u.toString(),
> (Line 333 for version 1.15) (content == null ? EMPTY_CONTENT :
> content),
> (Line 334 for version 1.15) response.getHeader("Content-Type"),
> response.getHeaders(), mimeTypes);
> (2) During html page parsing, inside org.apache.nutch.parse.html.HtmlParser
> of parse-html plugin, read the required metadata from the metadata of
> org.apache.nutch.protocol.Content, and customize the parsing process using
> the required metadata.
> If parsers have direct access to CrawlDatum, the above workaround is not
> needed. To give parsers the capacity to directly read and write CrawlDatum, I
> would like to suggest adding a new method "public ParseResult parse(Content
> content, CrawlDatum datum)" to org.apache.nutch.parse.ParseUtil in future
> versions of Nutch.
> To be compatible with current 1.15 and previous versions, I would like to
> suggest adding a new configuration property to nutch-default.xml. The default
> of the configuration property can be use the current method "public
> ParseResult parse(Content content)". If users want to use "public ParseResult
> parse(Content content, CrawlDatum datum)", they can change the property in
> nutch-site.xml.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)