Tika boilerpipe extractors

Arora, Madhvi Wed, 27 Jun 2018 08:54:43 -0700

Hi All,


Note reposting my question since looks like earlier one got posted on wrong  
thread.


We are using Nutch 1.13 and Solr 6. I am trying to use one of the parsers that 
come with Tika boilerpipe support.  I am getting best result for pages where 
there are only outlinks with CanolaExtractor in a page like this:

https://support.automationdirect.com/faq/dl205.php

But checking from Solr Admin Tool, unfortunately the parser is leaving out 
several outlinks in the indexed content. I do not know why CanolaExtractor 
would leave out certain outlinks.

If I do not use boilerpipe on Nutch then all the outlink gets indexed. To not 
use tika extractor I changed the property:

<property>
  <name>tika.extractor</name>
  <value>none</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

Anyone knows why CanolaExtractor cannot extract all the outlinks? Also which 
Tika Extractor should be used for the above mentioned  page example?


Any help will be great!

Thanks,

Madhvi

Tika boilerpipe extractors

Reply via email to