Delegate language identification to Tika
----------------------------------------

                 Key: NUTCH-1075
                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.4
            Reporter: Julien Nioche
            Assignee: Julien Nioche
             Fix For: 1.4


In 2.0 the language identification is delegated to Tika and is done as part of 
the parsing step (and not during the indexing as done currently).
The patch attached is a backport from trunk which implements this and adds a 
new parameter to determine the strategy to use

{code:xml} 
<property>
  <name>lang.extraction.policy</name>
  <value>detect,identify</value>
  <description>This determines when the plugin uses detection and
  statistical identification mechanisms. The order in which the
  detect and identify are written will determine the extraction
  policy. Default case (detect,identify)  means the plugin will
  first try to extract language info from page headers and metadata,
  if this is not successful it will try using tika language
  identification. Possible values are:
    detect
    identify
    detect,identify
    identify,detect
  </description>
</property>
{code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to