Mariusz Cieślukowski created TIKA-3100:
------------------------------------------

             Summary: RFC822Parser ignore charset when extractAllAlternatives 
set to true
                 Key: TIKA-3100
                 URL: https://issues.apache.org/jira/browse/TIKA-3100
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.24.1
         Environment:  

Windows 10 x64

OpenJDK 14
            Reporter: Mariusz Cieślukowski
         Attachments: testRFC822_quoted_charset_iso_8859_2

In default mode RFC822Parser seems to ignore charset defined in headers when 
detect content. When I set "extractAllAlternatives " to false then content 
seems fine.

Test case:


{code:java}
    @Test
    public void testQuotedPrintableCharset() {
        Metadata metadata = new Metadata();
        InputStream stream = 
getStream("test-documents/testRFC822_quoted_charset_iso_8859_2");
        ContentHandler handler = new BodyContentHandler();
        ParseContext context = new ParseContext();
        
        try {
            RFC822Parser emailparser = new RFC822Parser();
            emailparser.setExtractAllAlternatives(true);            
            emailparser.parse(stream, handler, metadata, context);
            String bodyText = handler.toString();
            assertTrue(bodyText.contains("Dzie\u0144 dobry."));
            
        } catch (Exception e) {
            fail("Exception thrown: " + e.getMessage());
        }
    }
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to