Sebastian Nagel created NUTCH-3176:
--------------------------------------

             Summary: URLUtil and urlnormalizer-basic: add support for IDNA2008
                 Key: NUTCH-3176
                 URL: https://issues.apache.org/jira/browse/NUTCH-3176
             Project: Nutch
          Issue Type: New Feature
          Components: plugin, urlnormalizer, util
    Affects Versions: 1.22
            Reporter: Sebastian Nagel
             Fix For: 1.23


IDNA2008, defined in [RFC 5890|https://www.rfc-editor.org/rfc/rfc5890], has 
superceded IDNA2003 ([RFC 3490|https://www.rfc-editor.org/rfc/rfc3490]) in 2008 
(as the name suggests).

When processing URLs and host names, IDNA2008 variants nowadays occur from time 
to time, causing issues if they fail to be processed. Corresponding Nutch 
tools, that is URLUtil and urlnormalizer-basic, should support IDNA2008.

IDNA2008 allows Unicode characters from versions newer to Unicode 3.2. There 
are also some deviations in the mapping between Unicode and ASCII. For example 
{{straße.de}} is mapped to {{strasse.de}} by IDNA2003 (an irreversible 
mapping), but to {{xn--strae-oqa.de}} by IDNA2008 (reversibel).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to