[ https://issues.apache.org/jira/browse/NUTCH-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467676#comment-17467676 ]
Fengtan commented on NUTCH-2278: -------------------------------- [~lewismc] I am still around :) although I do not use Nutch anymore. > Handle alpha-2 language codes consistently > ------------------------------------------ > > Key: NUTCH-2278 > URL: https://issues.apache.org/jira/browse/NUTCH-2278 > Project: Nutch > Issue Type: Improvement > Components: plugin > Affects Versions: 1.12 > Reporter: Fengtan > Priority: Minor > Fix For: 1.19 > > Attachments: NUTCH-2278.patch, NUTCH-2278.patch > > > The language-identifier plugin provides two extraction policies: detect and > identify. > However the two policies handle > [alpha-2|https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2] codes differently: > * 'identify' strips out the alpha-2 code e.g. if the identified language is > 'en-US' then it will inject 'en' in the meta tags > * 'detect' does not strip out the alpha-2 code e.g. if the detected language > is 'en-US' then it will inject 'en-US' in the meta tags > Any chance we can make this consistent and always strip out the alpha-2 code ? -- This message was sent by Atlassian Jira (v8.20.1#820001)