Hans Brende created ANY23-335:
---------------------------------

             Summary: Improve document default language detection
                 Key: ANY23-335
                 URL: https://issues.apache.org/jira/browse/ANY23-335
             Project: Apache Any23
          Issue Type: Improvement
          Components: core
    Affects Versions: 2.2
            Reporter: Hans Brende
             Fix For: 2.3


Currently, to get a document's default language to pass to an 
ExtractionContext, only the "xml:lang" attribute in the HTML node [is 
checked|https://github.com/apache/any23/blob/1867cc66de9a82cd98f1962fdabbd3a8680ff408/core/src/main/java/org/apache/any23/extractor/html/HTMLDocument.java#L338].

However, after reading [this w3 article on document language 
declaration|https://www.w3.org/International/questions/qa-html-language-declarations],
 and [this w3 article on meta 
declarations|https://www.w3.org/International/questions/qa-http-and-lang], it 
appears that we should *also* be checking the "lang" attribute, and, as a 
fallback, the META http-equiv="Content-Language" elements.

Also: there seems to be some [overlap here with (at least) the 
HTMLMetaExtractor|https://github.com/apache/any23/blob/1867cc66de9a82cd98f1962fdabbd3a8680ff408/core/src/main/java/org/apache/any23/extractor/html/HTMLMetaExtractor.java#L117],
 which, conversely, appears to check the "lang" attribute, and *not* the 
"xml:lang" attribute. Could the HTMLMetaExtractor just retrieve the default 
document language from the ExtractionContext rather than [looking it up in the 
document all over 
again|https://github.com/apache/any23/blob/1867cc66de9a82cd98f1962fdabbd3a8680ff408/core/src/main/java/org/apache/any23/extractor/html/HTMLMetaExtractor.java#L68]?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to