[
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838500#comment-17838500
]
Richard Eckart de Castilho commented on RAT-147:
------------------------------------------------
[~claude] ha, I found something:
https://lists.apache.org/thread/bwdbppbnpw6zdqqktwtmflpry53hbsr8
So it looks like the file was {{unix-newlines.txt.bin}} which indeed has a BOM.
I have attached it here and another similar file.
[^unix-newlines.txt.bin] [^windows-newlines.txt.bin]
> binary guesser design improvement
> ---------------------------------
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
> Issue Type: Improvement
> Affects Versions: 0.8
> Reporter: Marshall Schor
> Priority: Minor
> Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK. Another user tried
> building from source / tag, and RAT complained of 2 files missing headers.
> This was traced to the "binary guesser" which read the 1st 200 bytes of a
> file and "guessed" if it was binary. The file in question had a UTF-8
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.
> The reason for 2 different results: the release manager's OS had a default
> file encoding set to US-ASCII (as determined by running a small Java program
> that prints out the value of System.property("file.encoding"). This encoding
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed
> exception on the 3 bytes at the beginning of the file. This causes the
> guesser to conclude this is a "binary" file which doesn't need to be
> RAT-checked. The other user was on a Windows 7 machine, which has the
> file.encoding defaulting to Cp1252 - which does have code points defined for
> the first 3 bytes, and therefore doesn't throw any exception. This makes the
> guesser guess that this isn't a binary file, and it checks the file and
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will
> improperly skip checking files which perhaps should have headers, if they
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here. It might range from
> eliminating the binary "guesser" that looks at the first 200 bytes of a file,
> to forcing UTF-8 as the charset to use.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)