I am able to get parsetext data structure. But having trouble with parseData as it's constructor is asking for parsestatus, outlinks, contentmeta and parsemeta. Outlinks I can get from outlinkExtractor but what about other parameters? And again getoutlinks is asking for configuration and i don't know, from where I can get it?
On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: > You should go over each segment, and for each one produce a ParseText and > a ParseData. This is basically what the HTML Parser does for the whole > document, which is why I suggested you should dive into its code. > A ParseText is basically just a String containing the actual content of > the segment (after stripping the HTML tags). This is usually the document > you want to index. > The ParseData structure is a little more complex, but the main things it > contains are the title of this segment, and the outlinks from the segment > (for further crawling). Take a look at the code of both classes and it > should be relatively clear. > Finally, you need to build one ParseResult object, with the original URL, > and for each of the ParseText/ParseData pairs, call the put method, with > the internal URL of the segment as the key. > > > -----Original Message----- > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > Sent: 06 March 2018 14:45 > > To: user@nutch.apache.org > > Subject: RE: Regarding Internal Links > > > > > I am able to get the content corresponding to each Internal link by > > > writing a parse filter plugin. Now I am not getting how to proceed > > > further. How can I parse them as separate document and what should > > > my ParseResult filter return?? > >