[ https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Claude Warren resolved RAT-147. ------------------------------- Resolution: Fixed Resolved with pull request #240 https://github.com/apache/creadur-rat/pull/240 > binary guesser design improvement > --------------------------------- > > Key: RAT-147 > URL: https://issues.apache.org/jira/browse/RAT-147 > Project: Apache Rat > Issue Type: Improvement > Affects Versions: 0.8 > Reporter: Marshall Schor > Assignee: Claude Warren > Priority: Minor > Fix For: 0.17 > > Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin > > > A release manager cut a release; RAT was run, all was OK. Another user tried > building from source / tag, and RAT complained of 2 files missing headers. > This was traced to the "binary guesser" which read the 1st 200 bytes of a > file and "guessed" if it was binary. The file in question had a UTF-8 > byte-order mark at the beginning, and was, in fact after that, plain ASCII. > The reason for 2 different results: the release manager's OS had a default > file encoding set to US-ASCII (as determined by running a small Java program > that prints out the value of System.property("file.encoding"). This encoding > is for 7-bit ASCII, so the guesser when decoding this gets a malformed > exception on the 3 bytes at the beginning of the file. This causes the > guesser to conclude this is a "binary" file which doesn't need to be > RAT-checked. The other user was on a Windows 7 machine, which has the > file.encoding defaulting to Cp1252 - which does have code points defined for > the first 3 bytes, and therefore doesn't throw any exception. This makes the > guesser guess that this isn't a binary file, and it checks the file and > reports a missing header (the file is test data...). > Workaround - add the file to the explicit excludes. > Potential problem - on a machine with default encoding US-ASCII, RAT will > improperly skip checking files which perhaps should have headers, if they > have a UTF-8 byte-order mark. > Potential problem #2 - RAT is dependent on the default file encoding setting > for part of its behavior, causing differences in what it checks. > I'm not sure what a good solution would be here. It might range from > eliminating the binary "guesser" that looks at the first 200 bytes of a file, > to forcing UTF-8 as the charset to use. -- This message was sent by Atlassian Jira (v8.20.10#820010)