[jira] [Commented] (RAT-147) binary guesser design improvement
[ https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838720#comment-17838720 ] Claude Warren commented on RAT-147: --- The Tika parser correctly identifies these as Text files, and correctly locates the lines within. Tests added to. > binary guesser design improvement > - > > Key: RAT-147 > URL: https://issues.apache.org/jira/browse/RAT-147 > Project: Apache Rat > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Marshall Schor >Assignee: Claude Warren >Priority: Minor > Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin > > > A release manager cut a release; RAT was run, all was OK. Another user tried > building from source / tag, and RAT complained of 2 files missing headers. > This was traced to the "binary guesser" which read the 1st 200 bytes of a > file and "guessed" if it was binary. The file in question had a UTF-8 > byte-order mark at the beginning, and was, in fact after that, plain ASCII. > The reason for 2 different results: the release manager's OS had a default > file encoding set to US-ASCII (as determined by running a small Java program > that prints out the value of System.property("file.encoding"). This encoding > is for 7-bit ASCII, so the guesser when decoding this gets a malformed > exception on the 3 bytes at the beginning of the file. This causes the > guesser to conclude this is a "binary" file which doesn't need to be > RAT-checked. The other user was on a Windows 7 machine, which has the > file.encoding defaulting to Cp1252 - which does have code points defined for > the first 3 bytes, and therefore doesn't throw any exception. This makes the > guesser guess that this isn't a binary file, and it checks the file and > reports a missing header (the file is test data...). > Workaround - add the file to the explicit excludes. > Potential problem - on a machine with default encoding US-ASCII, RAT will > improperly skip checking files which perhaps should have headers, if they > have a UTF-8 byte-order mark. > Potential problem #2 - RAT is dependent on the default file encoding setting > for part of its behavior, causing differences in what it checks. > I'm not sure what a good solution would be here. It might range from > eliminating the binary "guesser" that looks at the first 200 bytes of a file, > to forcing UTF-8 as the charset to use. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RAT-147) binary guesser design improvement
[ https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838500#comment-17838500 ] Richard Eckart de Castilho commented on RAT-147: [~claude] ha, I found something: https://lists.apache.org/thread/bwdbppbnpw6zdqqktwtmflpry53hbsr8 So it looks like the file was {{unix-newlines.txt.bin}} which indeed has a BOM. I have attached it here and another similar file. [^unix-newlines.txt.bin] [^windows-newlines.txt.bin] > binary guesser design improvement > - > > Key: RAT-147 > URL: https://issues.apache.org/jira/browse/RAT-147 > Project: Apache Rat > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Marshall Schor >Priority: Minor > Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin > > > A release manager cut a release; RAT was run, all was OK. Another user tried > building from source / tag, and RAT complained of 2 files missing headers. > This was traced to the "binary guesser" which read the 1st 200 bytes of a > file and "guessed" if it was binary. The file in question had a UTF-8 > byte-order mark at the beginning, and was, in fact after that, plain ASCII. > The reason for 2 different results: the release manager's OS had a default > file encoding set to US-ASCII (as determined by running a small Java program > that prints out the value of System.property("file.encoding"). This encoding > is for 7-bit ASCII, so the guesser when decoding this gets a malformed > exception on the 3 bytes at the beginning of the file. This causes the > guesser to conclude this is a "binary" file which doesn't need to be > RAT-checked. The other user was on a Windows 7 machine, which has the > file.encoding defaulting to Cp1252 - which does have code points defined for > the first 3 bytes, and therefore doesn't throw any exception. This makes the > guesser guess that this isn't a binary file, and it checks the file and > reports a missing header (the file is test data...). > Workaround - add the file to the explicit excludes. > Potential problem - on a machine with default encoding US-ASCII, RAT will > improperly skip checking files which perhaps should have headers, if they > have a UTF-8 byte-order mark. > Potential problem #2 - RAT is dependent on the default file encoding setting > for part of its behavior, causing differences in what it checks. > I'm not sure what a good solution would be here. It might range from > eliminating the binary "guesser" that looks at the first 200 bytes of a file, > to forcing UTF-8 as the charset to use. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RAT-147) binary guesser design improvement
[ https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838498#comment-17838498 ] Richard Eckart de Castilho commented on RAT-147: [~claude] I assume [~schor] won't respond, so I'll chime in here. The issue probably came up during an Apache UIMA release a few years ago. From reading the description, it looks like the build was ok on [~schor]'s machine, but failed on another one (possibly mine or maybe that was even before my time). I did a bit of digging in the history of the UIMA Java SDK and uimaFIT repos. In particular the latter had a release around the time this issue was filed. However, I didn't find a file for which an exclude was added at the time that would match the characteristics described in the issue... sorry. > binary guesser design improvement > - > > Key: RAT-147 > URL: https://issues.apache.org/jira/browse/RAT-147 > Project: Apache Rat > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Marshall Schor >Priority: Minor > > A release manager cut a release; RAT was run, all was OK. Another user tried > building from source / tag, and RAT complained of 2 files missing headers. > This was traced to the "binary guesser" which read the 1st 200 bytes of a > file and "guessed" if it was binary. The file in question had a UTF-8 > byte-order mark at the beginning, and was, in fact after that, plain ASCII. > The reason for 2 different results: the release manager's OS had a default > file encoding set to US-ASCII (as determined by running a small Java program > that prints out the value of System.property("file.encoding"). This encoding > is for 7-bit ASCII, so the guesser when decoding this gets a malformed > exception on the 3 bytes at the beginning of the file. This causes the > guesser to conclude this is a "binary" file which doesn't need to be > RAT-checked. The other user was on a Windows 7 machine, which has the > file.encoding defaulting to Cp1252 - which does have code points defined for > the first 3 bytes, and therefore doesn't throw any exception. This makes the > guesser guess that this isn't a binary file, and it checks the file and > reports a missing header (the file is test data...). > Workaround - add the file to the explicit excludes. > Potential problem - on a machine with default encoding US-ASCII, RAT will > improperly skip checking files which perhaps should have headers, if they > have a UTF-8 byte-order mark. > Potential problem #2 - RAT is dependent on the default file encoding setting > for part of its behavior, causing differences in what it checks. > I'm not sure what a good solution would be here. It might range from > eliminating the binary "guesser" that looks at the first 200 bytes of a file, > to forcing UTF-8 as the charset to use. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (RAT-147) binary guesser design improvement
[ https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838483#comment-17838483 ] Claude Warren commented on RAT-147: --- [~schor] Do you have examples of these types of files? I know this was a decade ago but I am hoping we can get a test case built. > binary guesser design improvement > - > > Key: RAT-147 > URL: https://issues.apache.org/jira/browse/RAT-147 > Project: Apache Rat > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Marshall Schor >Priority: Minor > > A release manager cut a release; RAT was run, all was OK. Another user tried > building from source / tag, and RAT complained of 2 files missing headers. > This was traced to the "binary guesser" which read the 1st 200 bytes of a > file and "guessed" if it was binary. The file in question had a UTF-8 > byte-order mark at the beginning, and was, in fact after that, plain ASCII. > The reason for 2 different results: the release manager's OS had a default > file encoding set to US-ASCII (as determined by running a small Java program > that prints out the value of System.property("file.encoding"). This encoding > is for 7-bit ASCII, so the guesser when decoding this gets a malformed > exception on the 3 bytes at the beginning of the file. This causes the > guesser to conclude this is a "binary" file which doesn't need to be > RAT-checked. The other user was on a Windows 7 machine, which has the > file.encoding defaulting to Cp1252 - which does have code points defined for > the first 3 bytes, and therefore doesn't throw any exception. This makes the > guesser guess that this isn't a binary file, and it checks the file and > reports a missing header (the file is test data...). > Workaround - add the file to the explicit excludes. > Potential problem - on a machine with default encoding US-ASCII, RAT will > improperly skip checking files which perhaps should have headers, if they > have a UTF-8 byte-order mark. > Potential problem #2 - RAT is dependent on the default file encoding setting > for part of its behavior, causing differences in what it checks. > I'm not sure what a good solution would be here. It might range from > eliminating the binary "guesser" that looks at the first 200 bytes of a file, > to forcing UTF-8 as the charset to use. -- This message was sent by Atlassian Jira (v8.20.10#820010)