[jira] [Commented] (RAT-147) binary guesser design improvement

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838720#comment-17838720
 ] 

Claude Warren commented on RAT-147:
---

The Tika parser correctly identifies these as Text files, and correctly locates 
the lines within.

Tests added to.

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Assignee: Claude Warren
>Priority: Minor
> Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-147) binary guesser design improvement

2024-04-18 Thread Richard Eckart de Castilho (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838500#comment-17838500
 ] 

Richard Eckart de Castilho commented on RAT-147:


[~claude] ha, I found something:

https://lists.apache.org/thread/bwdbppbnpw6zdqqktwtmflpry53hbsr8

So it looks like the file was {{unix-newlines.txt.bin}} which indeed has a BOM. 
I have attached it here and another similar file.

 [^unix-newlines.txt.bin]  [^windows-newlines.txt.bin] 



> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Priority: Minor
> Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-147) binary guesser design improvement

2024-04-18 Thread Richard Eckart de Castilho (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838498#comment-17838498
 ] 

Richard Eckart de Castilho commented on RAT-147:


[~claude] I assume [~schor] won't respond, so I'll chime in here. The issue 
probably came up during an Apache UIMA release a few years ago. From reading 
the description, it looks like the build was ok on [~schor]'s machine, but 
failed on another one (possibly mine or maybe that was even before my time). I 
did a bit of digging in the history of the UIMA Java SDK and uimaFIT repos. In 
particular the latter had a release around the time this issue was filed. 
However, I didn't find a file for which an exclude was added at the time that 
would match the characteristics described in the issue... sorry.

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Priority: Minor
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-147) binary guesser design improvement

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838483#comment-17838483
 ] 

Claude Warren commented on RAT-147:
---

[~schor] Do you have examples of these types of files?  I know this was a 
decade ago but I am hoping we can get a test case built.

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Priority: Minor
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)