Hi again,

Dennis Kubes wrote:

> Nutch gets outlinks from the pages it parses.  This is either during the
> fetch process with parsing enabled or during just a parse process (see
> org.apache.nutch.parse.ParseSegment).  The content is parsed via plugins
> configured in parse-plugins.xml in the conf directory.  During the parse
> links are created as Outlink objects that are added to a ParseData
> object that is itself added to a Parse object.  During the writing out
> of the parse object (ParseOutputFormat) the outlinks are saved as
> CrawlDatums in the crawl_parse directory under the segment.  Then during
> the UpdateDb job (see CrawlDb) this crawl_parse is merged into the
> master Crawl Database.  That is the long answer.
> 
> Short answer is when you parse get Outlinks and add them to the
> ParseData -> Parse object and then will be updated automatically to he
> CrawlDb when the UpdateDb job is run and it will be fetched when the
> next Fetch job is run.

I was attempting to do this from an HtmlParseFilter plugin, at which
point the data is already parsed and the Outlinks have already been
created.  I thought there might be a way to modify the Outlinks at this
point, but I haven't found one.

It looks like the work that I'm interested on is being done in
DOMContentUtils.getOutlinks, the relevant bit of code from HtmlParser being:

      utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
      outlinks = (Outlink[])l.toArray(new Outlink[l.size()]);

Soon after the outlinks are assigned to the ParseData object, and
there's no method to modify that array.

Is there a plugin type that would allow me to extend this without
altering the HtmlParse plugin, or at least DOMContentUtils?


I'm just getting acquainted with Nutch organization, so please be
patient if I ask an obvious question.  Thanks in advance,



Ricardo J. Méndez
http://ricardo.strangevistas.net/

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to