Hi All,
We are using Nutch 1.13 and Solr 6. I am trying to use one of the parsers that come with Tika boilerpipe support. I am getting best result for pages where there are only outlinks with CanolaExtractor in a page like this: https://support.automationdirect.com/faq/dl205.php But checking from Solr Admin Tool, unfortunately the parser is leaving out several outlinks in the indexed content. I do not know why CanolaExtractor would leave out certain outlinks. If I do not use boilerpipe on Nutch then all the outlink gets indexed. To not use tika extractor I changed the property: <property> <name>tika.extractor</name> <value>none</value> <description> Which text extraction algorithm to use. Valid values are: boilerpipe or none. </description> </property> Anyone knows why CanolaExtractor cannot extract all the outlinks? Also which Tika Extractor should be used for the above mentioned page example? Any help will be great! Thanks, Madhvi