[ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150752#comment-16150752
 ] 

Tim Allison commented on TIKA-2219:
-----------------------------------

I defer to others, but I think RFC822 specifies old school ascii as default and 
doesn't expect or allow anything else unless declared as in the example above.

We can allow for more flexibility than the spec allows, but the challenge is 
that the underlying mime4j parser is reading an inputstream and interpreting it 
as ASCII.  I don't think we can pass it a {{Reader}} or otherwise specify an 
encoding...I only looked very quickly tho.

Any recommendations?

> CharsetDetector no longer detects windows-1252 charset
> ------------------------------------------------------
>
>                 Key: TIKA-2219
>                 URL: https://issues.apache.org/jira/browse/TIKA-2219
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>             Fix For: 2.0, 1.15
>
>         Attachments: test.txt
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //                    CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //                    matches.add(m);
> // Add this instead:
>                     matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to