[ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916899#action_12916899 ]
Doğacan Güney commented on NUTCH-894: ------------------------------------- +1 from me. If there are no objections for the next couple days or so, I would like to commit this patch. > Move statistical language identification from indexing to parsing step > ---------------------------------------------------------------------- > > Key: NUTCH-894 > URL: https://issues.apache.org/jira/browse/NUTCH-894 > Project: Nutch > Issue Type: Improvement > Components: parser > Affects Versions: 2.0 > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 2.0 > > Attachments: NUTCH-894.patch > > > The statistical identification of language is currently done part in the > indexing step, whereas the detection based on HTTP header and HTML code is > done during the parsing. > We could keep the same logic i.e. do the statistical detection only if > nothing has been found with the previous methods but as part of the parsing. > This would be useful for ParseFilters which need the language information or > to use with ScoringFilters e.g. to focus the crawl on a set of languages. > Since the statistical models have been ported to Tika we should probably rely > on them instead of maintaining our own. > Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.