[jira] [Created] (RAT-147) binary guesser design improvement

Marshall Schor (JIRA) Tue, 27 Aug 2013 14:00:30 -0700

Marshall Schor created RAT-147:
----------------------------------

             Summary: binary guesser design improvement
                 Key: RAT-147
                 URL: https://issues.apache.org/jira/browse/RAT-147
             Project: Apache Rat
          Issue Type: Improvement
    Affects Versions: 0.8
            Reporter: Marshall Schor
            Priority: Minor



A release manager cut a release; RAT was run, all was OK.  Another user tried 
building from source / tag, and RAT complained of 2 files missing headers.  
This was traced to the "binary guesser" which read the 1st 200 bytes of a file 
and "guessed" if it was binary.  The file in question had a UTF-8 byte-order 
mark at the beginning, and was, in fact after that, plain ASCII.  The reason 
for 2 different results: the release manager's OS had a default file encoding 
set to US-ASCII (as determined by running a small Java program that prints out 
the value of System.property("file.encoding").  This encoding is for 7-bit 
ASCII, so the guesser when decoding this gets a malformed exception on the 3 
bytes at the beginning of the file.  This causes the guesser to conclude this 
is a "binary" file which doesn't need to be RAT-checked.  The other user was on 
a Windows 7 machine, which has the file.encoding defaulting to Cp1252 - which 
does have code points defined for the first 3 bytes, and therefore doesn't 
throw any exception.  This makes the guesser guess that  this isn't a binary 
file, and it checks the file and reports a missing header (the file is test 
data...).

Workaround - add the file to the explicit excludes.

Potential problem - on a machine with default encoding US-ASCII, RAT will 
improperly skip checking files which perhaps should have headers, if they have 
a UTF-8 byte-order mark.

Potential problem #2 - RAT is dependent on the default file encoding setting 
for part of its behavior, causing differences in what it checks.

I'm not sure what a good solution would be here.  It might range from 
eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
to forcing UTF-8 as the charset to use.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (RAT-147) binary guesser design improvement

Reply via email to