[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923919#action_12923919
 ] 

Markus Jelsma commented on NUTCH-923:
-------------------------------------

Andrzej is right. The LanguageIndexingFilter can return a value based on the 
value found in the HTTP header which can return garbage but shouldn't the 
filter itself make sure either `unknown` or a valid ISO-639-2 value is set?

This way client code can safely rely on the value of the lang field instead of 
sanitizing. What if more components come that do something with the lang field, 
must they also sanitize on their own?

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page 
> (for example via the language-identifier plugin) and send the content to 
> corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and 
> tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to