[RAT] [DISCUSS] Add inspection of archives.

2024-04-18 Thread Claude Warren
Currently Rat does no inspection of archives.  This means that a jar that
does not meet the licensing of a project could be included and would not be
detected.

Currently the DefaultAnalyser simply marks the archives and archives and
does nothing more with them.  Under the proposed Tika change this has not
changed, but we do have better identification of archives.

I would like to see the DefaultAnalyser open the archives and process the
contents via what is essentially another default analyser instance.  The
idea is that the result of scanning the contents of the archive will be
reported as the scan of the jar itself.  So if it has 3 licenses the report
for the archive itself will state that it has the licenses.

Tika can provide a hashes of files.  I suggest we use those to track files
that have already been processed, so if an archive is found 2x we report
the first one with the licenses and such and the second as a duplicate of
the first.

I think we should add the hashes to the XML report as properties of the
resource element describing the file.

I also think that we should add the hashes as properties of the resource
element.  The hashes can be useful in exploring SBOM entries and similar.

Thoughts?

Claude


[PR] Bump commons-cli:commons-cli from 1.6.0 to 1.7.0 [creadur-rat]

2024-04-18 Thread via GitHub


dependabot[bot] opened a new pull request, #241:
URL: https://github.com/apache/creadur-rat/pull/241

   Bumps commons-cli:commons-cli from 1.6.0 to 1.7.0.
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=commons-cli:commons-cli=maven=1.6.0=1.7.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (RAT-20) [GOOGLE-1] Detection of binaries should be smarter

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren reassigned RAT-20:


Assignee: Claude Warren

> [GOOGLE-1] Detection of binaries should be smarter
> --
>
> Key: RAT-20
> URL: https://issues.apache.org/jira/browse/RAT-20
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: mime-meta-data
>Reporter: Robert Burrell Donkin
>Assignee: Claude Warren
>Priority: Major
>
> 
> Right now we use a heuristic to guess if the file is an executable based on
> the filename.  This should get smarter, perhaps sniffing the content of the
> file or something.
> 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-20) [GOOGLE-1] Detection of binaries should be smarter

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838755#comment-17838755
 ] 

Claude Warren commented on RAT-20:
--

This ticket is resolved with Tika changes.

> [GOOGLE-1] Detection of binaries should be smarter
> --
>
> Key: RAT-20
> URL: https://issues.apache.org/jira/browse/RAT-20
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: mime-meta-data
>Reporter: Robert Burrell Donkin
>Assignee: Claude Warren
>Priority: Major
>
> 
> Right now we use a heuristic to guess if the file is an executable based on
> the filename.  This should get smarter, perhaps sniffing the content of the
> file or something.
> 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-54) MIME Detection Using Tika

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838753#comment-17838753
 ] 

Claude Warren commented on RAT-54:
--

The Tika changes apply tika based mime detection.

> MIME Detection Using Tika
> -
>
> Key: RAT-54
> URL: https://issues.apache.org/jira/browse/RAT-54
> Project: Apache Rat
>  Issue Type: New Feature
>Affects Versions: 0.7
>Reporter: Robert Burrell Donkin
>Assignee: Claude Warren
>Priority: Major
>
> Tika provides sophisticated and comprehensive MIME detection. Add support for 
> a Tika based implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (RAT-53) Factor Out Mime Detection

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren resolved RAT-53.
--
Resolution: Won't Fix

Approach is changing to use Tika for parseing and mime identification.

> Factor Out Mime Detection
> -
>
> Key: RAT-53
> URL: https://issues.apache.org/jira/browse/RAT-53
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: mime-meta-data
>Affects Versions: 0.7
>Reporter: Robert Burrell Donkin
>Priority: Major
>
> Factor out MIME detection to allow multiple implementations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (RAT-54) MIME Detection Using Tika

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-54?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren reassigned RAT-54:


Assignee: Claude Warren

> MIME Detection Using Tika
> -
>
> Key: RAT-54
> URL: https://issues.apache.org/jira/browse/RAT-54
> Project: Apache Rat
>  Issue Type: New Feature
>Affects Versions: 0.7
>Reporter: Robert Burrell Donkin
>Assignee: Claude Warren
>Priority: Major
>
> Tika provides sophisticated and comprehensive MIME detection. Add support for 
> a Tika based implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-147) binary guesser design improvement

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838720#comment-17838720
 ] 

Claude Warren commented on RAT-147:
---

The Tika parser correctly identifies these as Text files, and correctly locates 
the lines within.

Tests added to.

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Assignee: Claude Warren
>Priority: Minor
> Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (RAT-211) Generated rat-output.xml must be well-formed, even if BinaryGuesser fails

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838711#comment-17838711
 ] 

Claude Warren edited comment on RAT-211 at 4/18/24 4:02 PM:


When Tika is used the files report as
{noformat}
{noformat}
and
{noformat}
{noformat}
 

Which I believe is correct and keeps us from dumping the content.

 

Test cases have been added to verify.


was (Author: claudenw):
When Tika is used the files report as
{noformat}
{noformat}
and
{noformat}
{noformat}
 

Which I believe is correct and keeps us from dumping the content.

> Generated rat-output.xml must be well-formed, even if BinaryGuesser fails
> -
>
> Key: RAT-211
> URL: https://issues.apache.org/jira/browse/RAT-211
> Project: Apache Rat
>  Issue Type: Bug
>Reporter: Konstantin Kolinko
>Assignee: Claude Warren
>Priority: Major
> Attachments: rat-output.xml
>
>
> This issue was originally reported by Infrastructure team while running RAT 
> over Apache Tomcat source code, see thread
> "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev "at" 
> tomcat.apache.org mailing list. (1)
> The issue:
> ===
> 1. Buildbot at ASF is configured to run RAT tool over tomcat-trunk, tomcat-8, 
> tomcat-7 source code.
> 2. Tomcat has \*.bmp, \*.dia files in its source code (images used by Windows 
> installer, diagrams in documentation) that RAT failed to recognize as binary.
> 3. RAT generated rat-output.xml file that included header-sample fragments of 
> those *.bmp and *.dia files. Those fragments are actually binary garbage.  
> The result is that a broken XML file was generated.
> 4. XSLT transformation from rat-output.xml into rat-output.html failed.
> I have not seen the actual error printed by XSLT processor, but I confirmed 
> that the file is broken by downloading rat-output.xml and opening it in 
> Firefox. Firefox reported a syntax error.
> Workaround:
> ===
> rat-excludes.txt file in Tomcat source code was updated to exclude
> \*\*/\*.bmp
> \*\*/\*.dia
> References:
> ===
> 1. "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev 
> "at" tomcat.apache.org mailing list.
> http://markmail.org/message/rhrm54ch5omjalt4
> 2. Apache Tomcat links to Buildbot resuls:
> http://tomcat.apache.org/ci.html#Buildbot
> 3. Apache Tomcat source code
> http://tomcat.apache.org/svn.html
> Notes:
> - RAT excludes files in Tomcat source code are at
> res/rat/rat-excludes.txt
> - I know that Buildbot uses Ant to run RAT. The Ant project file for that is 
> not in Tomcat sources, but in Infrastructure configuration (I do not have a 
> link). It can be seen in "shell_5 RAT Report Complete" step during build run. 
> E.g. here:
> https://ci.apache.org/builders/tomcat-trunk/builds/1061
> - I do not know what version of RAT is used by that build slave on Buildbot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-211) Generated rat-output.xml must be well-formed, even if BinaryGuesser fails

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838711#comment-17838711
 ] 

Claude Warren commented on RAT-211:
---

When Tika is used the files report as
{noformat}
{noformat}
and
{noformat}
{noformat}
 

Which I believe is correct and keeps us from dumping the content.

> Generated rat-output.xml must be well-formed, even if BinaryGuesser fails
> -
>
> Key: RAT-211
> URL: https://issues.apache.org/jira/browse/RAT-211
> Project: Apache Rat
>  Issue Type: Bug
>Reporter: Konstantin Kolinko
>Assignee: Claude Warren
>Priority: Major
> Attachments: rat-output.xml
>
>
> This issue was originally reported by Infrastructure team while running RAT 
> over Apache Tomcat source code, see thread
> "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev "at" 
> tomcat.apache.org mailing list. (1)
> The issue:
> ===
> 1. Buildbot at ASF is configured to run RAT tool over tomcat-trunk, tomcat-8, 
> tomcat-7 source code.
> 2. Tomcat has \*.bmp, \*.dia files in its source code (images used by Windows 
> installer, diagrams in documentation) that RAT failed to recognize as binary.
> 3. RAT generated rat-output.xml file that included header-sample fragments of 
> those *.bmp and *.dia files. Those fragments are actually binary garbage.  
> The result is that a broken XML file was generated.
> 4. XSLT transformation from rat-output.xml into rat-output.html failed.
> I have not seen the actual error printed by XSLT processor, but I confirmed 
> that the file is broken by downloading rat-output.xml and opening it in 
> Firefox. Firefox reported a syntax error.
> Workaround:
> ===
> rat-excludes.txt file in Tomcat source code was updated to exclude
> \*\*/\*.bmp
> \*\*/\*.dia
> References:
> ===
> 1. "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev 
> "at" tomcat.apache.org mailing list.
> http://markmail.org/message/rhrm54ch5omjalt4
> 2. Apache Tomcat links to Buildbot resuls:
> http://tomcat.apache.org/ci.html#Buildbot
> 3. Apache Tomcat source code
> http://tomcat.apache.org/svn.html
> Notes:
> - RAT excludes files in Tomcat source code are at
> res/rat/rat-excludes.txt
> - I know that Buildbot uses Ant to run RAT. The Ant project file for that is 
> not in Tomcat sources, but in Infrastructure configuration (I do not have a 
> link). It can be seen in "shell_5 RAT Report Complete" step during build run. 
> E.g. here:
> https://ci.apache.org/builders/tomcat-trunk/builds/1061
> - I do not know what version of RAT is used by that build slave on Buildbot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-211) Generated rat-output.xml must be well-formed, even if BinaryGuesser fails

2024-04-18 Thread Konstantin Kolinko (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838538#comment-17838538
 ] 

Konstantin Kolinko commented on RAT-211:


Thank you for looking at his old issue.

The thread "Files to exclude from buildbot rat tests" (started 2016-02-15) at 
dev "at" tomcat.apache.org mailing list can be found here:
[https://lists.apache.org/thread/lt04vmk5xh6kn420k9cnln8pbn230pzo]

Versions of Apache Tomcat released in year 2016 are listed here:
[https://tomcat.apache.org/oldnews-2016.html]
e.g. Tomcat 7.0.68 Released 2016-02-16

The source code has been moved to an "archive" directory at svn since we 
migrated to git and for version 7.0.68 can be found here:
[https://svn.apache.org/repos/asf/tomcat/archive/tc7.0.x/tags/TOMCAT_7_0_68/]

See there the following files:

* res/side_left.bmp
* webapps/docs/tribes/leader-election-message-arrives.dia

[https://svn.apache.org/repos/asf/tomcat/archive/tc7.0.x/tags/TOMCAT_7_0_68/res/]
[https://svn.apache.org/repos/asf/tomcat/archive/tc7.0.x/tags/TOMCAT_7_0_68/webapps/docs/tribes/]

---

At build time those files are copied as is to output/dist/src when building a 
source archive, and that is where RAT found them. The copying is performed by 
target name="dist-source" in build.xml.

The rat-excludes file is at
[https://svn.apache.org/repos/asf/tomcat/archive/tc7.0.x/tags/TOMCAT_7_0_68/res/rat/]

> Generated rat-output.xml must be well-formed, even if BinaryGuesser fails
> -
>
> Key: RAT-211
> URL: https://issues.apache.org/jira/browse/RAT-211
> Project: Apache Rat
>  Issue Type: Bug
>Reporter: Konstantin Kolinko
>Assignee: Claude Warren
>Priority: Major
> Attachments: rat-output.xml
>
>
> This issue was originally reported by Infrastructure team while running RAT 
> over Apache Tomcat source code, see thread
> "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev "at" 
> tomcat.apache.org mailing list. (1)
> The issue:
> ===
> 1. Buildbot at ASF is configured to run RAT tool over tomcat-trunk, tomcat-8, 
> tomcat-7 source code.
> 2. Tomcat has \*.bmp, \*.dia files in its source code (images used by Windows 
> installer, diagrams in documentation) that RAT failed to recognize as binary.
> 3. RAT generated rat-output.xml file that included header-sample fragments of 
> those *.bmp and *.dia files. Those fragments are actually binary garbage.  
> The result is that a broken XML file was generated.
> 4. XSLT transformation from rat-output.xml into rat-output.html failed.
> I have not seen the actual error printed by XSLT processor, but I confirmed 
> that the file is broken by downloading rat-output.xml and opening it in 
> Firefox. Firefox reported a syntax error.
> Workaround:
> ===
> rat-excludes.txt file in Tomcat source code was updated to exclude
> \*\*/\*.bmp
> \*\*/\*.dia
> References:
> ===
> 1. "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev 
> "at" tomcat.apache.org mailing list.
> http://markmail.org/message/rhrm54ch5omjalt4
> 2. Apache Tomcat links to Buildbot resuls:
> http://tomcat.apache.org/ci.html#Buildbot
> 3. Apache Tomcat source code
> http://tomcat.apache.org/svn.html
> Notes:
> - RAT excludes files in Tomcat source code are at
> res/rat/rat-excludes.txt
> - I know that Buildbot uses Ant to run RAT. The Ant project file for that is 
> not in Tomcat sources, but in Infrastructure configuration (I do not have a 
> link). It can be seen in "shell_5 RAT Report Complete" step during build run. 
> E.g. here:
> https://ci.apache.org/builders/tomcat-trunk/builds/1061
> - I do not know what version of RAT is used by that build slave on Buildbot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Bump commons-cli:commons-cli from 1.6.0 to 1.7.0 [creadur-whisker]

2024-04-18 Thread via GitHub


ottlinger merged PR #140:
URL: https://github.com/apache/creadur-whisker/pull/140


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@creadur.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (RAT-2) RAT reports should be able to skip certain file types or contents:

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838516#comment-17838516
 ] 

Claude Warren commented on RAT-2:
-

Significant progress has been made in this area.  Please check the 
include/exclude flags and see if they, along with the ability to ignore source 
code control files meets the requirement.

> RAT reports should be able to skip certain file types or contents:
> --
>
> Key: RAT-2
> URL: https://issues.apache.org/jira/browse/RAT-2
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: core engine
>Reporter: Sebb
>Assignee: Claude Warren
>Priority: Major
>
> RAT reports should be able to skip certain file types or contents:
> MANIFEST files
> *.css where the contents is a single "@include" line
> CHANGES file
> It would be useful if exceptions could be configured on a per-project basis



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (RAT-147) binary guesser design improvement

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren reassigned RAT-147:
-

Assignee: Claude Warren

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Assignee: Claude Warren
>Priority: Minor
> Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-150) RAT should use Apache Tika to simply guess ignored [application/X] file types and focus on the [text/Y] family as a sensible default

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838513#comment-17838513
 ] 

Claude Warren commented on RAT-150:
---

I am working on a patch to use tika to discover the file type and map that to 
our file types.  It seems that Tika will give us the textual content of the 
file if we ask for it.  However, by default it seems to strip comments from XML 
so I will have to figure out how to ask it for comments too.

> RAT should use Apache Tika to simply guess ignored [application/X] file types 
> and focus on the [text/Y] family as a sensible default
> 
>
> Key: RAT-150
> URL: https://issues.apache.org/jira/browse/RAT-150
> Project: Apache Rat
>  Issue Type: New Feature
>  Components: mime-meta-data, scan
>Affects Versions: 0.8
>Reporter: Chris A. Mattmann
>Assignee: Claude Warren
>Priority: Major
>
> RAT could use Apache Tika to automatically guess file types, obviating the 
> need to specify an explicit white list or black list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (RAT-2) RAT reports should be able to skip certain file types or contents:

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren reassigned RAT-2:
---

Assignee: Claude Warren

> RAT reports should be able to skip certain file types or contents:
> --
>
> Key: RAT-2
> URL: https://issues.apache.org/jira/browse/RAT-2
> Project: Apache Rat
>  Issue Type: Improvement
>  Components: core engine
>Reporter: Sebb
>Assignee: Claude Warren
>Priority: Major
>
> RAT reports should be able to skip certain file types or contents:
> MANIFEST files
> *.css where the contents is a single "@include" line
> CHANGES file
> It would be useful if exceptions could be configured on a per-project basis



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (RAT-150) RAT should use Apache Tika to simply guess ignored [application/X] file types and focus on the [text/Y] family as a sensible default

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren reassigned RAT-150:
-

Assignee: Claude Warren

> RAT should use Apache Tika to simply guess ignored [application/X] file types 
> and focus on the [text/Y] family as a sensible default
> 
>
> Key: RAT-150
> URL: https://issues.apache.org/jira/browse/RAT-150
> Project: Apache Rat
>  Issue Type: New Feature
>  Components: mime-meta-data, scan
>Affects Versions: 0.8
>Reporter: Chris A. Mattmann
>Assignee: Claude Warren
>Priority: Major
>
> RAT could use Apache Tika to automatically guess file types, obviating the 
> need to specify an explicit white list or black list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-301) Rat check file identification error,java files with Chinese characters are recognized as binary files

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838508#comment-17838508
 ] 

Claude Warren commented on RAT-301:
---

[~casion]  I am working on a test to add Tika to replace the file guessers in 
the current Rat.  Do you have an example of a file that causes the problem and 
can you attach it to this ticket.?

> Rat check file identification error,java files with Chinese characters are 
> recognized as binary files
> -
>
> Key: RAT-301
> URL: https://issues.apache.org/jira/browse/RAT-301
> Project: Apache Rat
>  Issue Type: Bug
>Affects Versions: 0.13
> Environment: Window  
> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 
> 2015-11-11T00:41:47+08:00)
>Reporter: Chen Xia
>Assignee: Claude Warren
>Priority: Major
>
> {code:java}
> // code placeholder
> 
> 
> org.apache.rat
> apache-rat-plugin
> 0.13
> 
> 
> rat-validate
> validate
> 
> check
> 
> 
> 
> 
> 
> **/*.versionsBackup
> **/.idea/
> **/*.iml
> **/*.txt
> **/*.json
> web/.editorconfig
> web/.env
> web/.eslintignore
> web/.jshintrc
> web/public/favicon.ico
> web/dist/**
> web/node_modules/**
> web/apache-linkis-*-web-bin.tar.gz
> **/*.md
> .git/
> .gitignore
> **/.settings/*
> **/.classpath
> **/.project
> **/target/**
> **/out/**
> **/*.log
> CONTRIBUTING.md
> CONTRIBUTING_CN.md
> DISCLAIMER
> DISCLAIMER
> README.md
> **/META-INF/**
> .github/**
> compiler/**
> **/generated/**
> 
> 
>  {code}
> This is the result of {{mvn apache-rat:check}}
> {code:java}
> Summary
> ---
> Generated at: 2022-05-06T09:56:39+08:00
> Notes: 0
> Binaries: 1
> Archives: 0
> Standards: 13
> Apache Licensed: 13
> Generated Documents: 0
> JavaDocs are generated, thus a license header is optional.
> Generated files do not require license headers.
> 0 Unknown Licenses
> *
>   Files with Apache License headers will be marked AL
>   Binary files (which do not require any license headers) will be marked B
>   B 
> D:/DataSphere/linkis_svn/1.1.1-RC1/apache-linkis-1.1.1-incubating-src/apache-linkis-1.1.1-incubating-src/linkis-public-enhancements/linkis-publicservice/linkis-udf/linkis-udf-common/src/main/java/org/apache/linkis/udf/entity/UDFVersion.java
>   AL
> D:/DataSphere/linkis_svn/1.1.1-RC1/apache-linkis-1.1.1-incubating-src/apache-linkis-1.1.1-incubating-src/linkis-public-enhancements/linkis-publicservice/linkis-udf/linkis-udf-common/src/main/java/org/apache/linkis/udf/excepiton/UDFException.java
>   
> * {code}
> UDFVersion.java is recognized as a binary file
> source code: https://github.com/casionone/incubator-linkis/tree/dev-1.1.1-rat



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-110) Add a maven configuration option to define a target license in order to mark a project as compliant

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838507#comment-17838507
 ] 

Claude Warren commented on RAT-110:
---

[~pottlinger] I believe that this may be fixed in 0.16.1 where you can create a 
configuration file that specifies what license families to accept or reject.

run the command line version and pass the --help for a list of command line 
options.  I believe that the license inclusion/exclusion is implemented in the 
Maven plugin as well.

[~kwin] I believe your suggestion is also implemented in 0.16.1 where you can 
specify no default licenses.

> Add a maven configuration option to define a target license in order to mark 
> a project as compliant
> ---
>
> Key: RAT-110
> URL: https://issues.apache.org/jira/browse/RAT-110
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.9
>Reporter: Philipp Ottlinger
>Assignee: Claude Warren
>Priority: Major
>
> Currently rat searches for ASLed files. In order to broaden the usage of this 
> tool I'd like to add a new maven configuration option that allows the 
> definition of a target license.
> This license has to exist in RAT and changes the output of the rat report but 
> does not change the default behaviour for backwards compatibility.
> =OLD REPORT=
> *
> Summary
> ---
> Notes: 1
> Binaries: 187
> Archives: 0
> Standards: 149
> Apache Licensed: 2
> 
> *
>   Files with Apache License headers will be marked AL
> .
> =NEW REPORT=
> If no configuration option is supplied above report stays the same, but may 
> change for different licenses (e.g. GPL / RAT-13).
> ==MVN CONFIGURATION==
> 
>   
>GNU General Public License, version 
> 3
>.
> will lead to the output:
> ==FLEXIBLE REPORT==
> *
> Summary
> ---
> Notes: 1
> Binaries: 187
> Archives: 0
> Standards: 149
> GPL3 Licensed: 2
> 
> *
>   Files with GNU General Public License, version 3 headers will be marked 
> GPL3.
> ...
> Since I do have a patch for RAT-13 this were the next step to realise a 
> multilicense usage of rat.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (RAT-301) Rat check file identification error,java files with Chinese characters are recognized as binary files

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren reassigned RAT-301:
-

Assignee: Claude Warren

> Rat check file identification error,java files with Chinese characters are 
> recognized as binary files
> -
>
> Key: RAT-301
> URL: https://issues.apache.org/jira/browse/RAT-301
> Project: Apache Rat
>  Issue Type: Bug
>Affects Versions: 0.13
> Environment: Window  
> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 
> 2015-11-11T00:41:47+08:00)
>Reporter: Chen Xia
>Assignee: Claude Warren
>Priority: Major
>
> {code:java}
> // code placeholder
> 
> 
> org.apache.rat
> apache-rat-plugin
> 0.13
> 
> 
> rat-validate
> validate
> 
> check
> 
> 
> 
> 
> 
> **/*.versionsBackup
> **/.idea/
> **/*.iml
> **/*.txt
> **/*.json
> web/.editorconfig
> web/.env
> web/.eslintignore
> web/.jshintrc
> web/public/favicon.ico
> web/dist/**
> web/node_modules/**
> web/apache-linkis-*-web-bin.tar.gz
> **/*.md
> .git/
> .gitignore
> **/.settings/*
> **/.classpath
> **/.project
> **/target/**
> **/out/**
> **/*.log
> CONTRIBUTING.md
> CONTRIBUTING_CN.md
> DISCLAIMER
> DISCLAIMER
> README.md
> **/META-INF/**
> .github/**
> compiler/**
> **/generated/**
> 
> 
>  {code}
> This is the result of {{mvn apache-rat:check}}
> {code:java}
> Summary
> ---
> Generated at: 2022-05-06T09:56:39+08:00
> Notes: 0
> Binaries: 1
> Archives: 0
> Standards: 13
> Apache Licensed: 13
> Generated Documents: 0
> JavaDocs are generated, thus a license header is optional.
> Generated files do not require license headers.
> 0 Unknown Licenses
> *
>   Files with Apache License headers will be marked AL
>   Binary files (which do not require any license headers) will be marked B
>   B 
> D:/DataSphere/linkis_svn/1.1.1-RC1/apache-linkis-1.1.1-incubating-src/apache-linkis-1.1.1-incubating-src/linkis-public-enhancements/linkis-publicservice/linkis-udf/linkis-udf-common/src/main/java/org/apache/linkis/udf/entity/UDFVersion.java
>   AL
> D:/DataSphere/linkis_svn/1.1.1-RC1/apache-linkis-1.1.1-incubating-src/apache-linkis-1.1.1-incubating-src/linkis-public-enhancements/linkis-publicservice/linkis-udf/linkis-udf-common/src/main/java/org/apache/linkis/udf/excepiton/UDFException.java
>   
> * {code}
> UDFVersion.java is recognized as a binary file
> source code: https://github.com/casionone/incubator-linkis/tree/dev-1.1.1-rat



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (RAT-110) Add a maven configuration option to define a target license in order to mark a project as compliant

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren reassigned RAT-110:
-

Assignee: Claude Warren

> Add a maven configuration option to define a target license in order to mark 
> a project as compliant
> ---
>
> Key: RAT-110
> URL: https://issues.apache.org/jira/browse/RAT-110
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.9
>Reporter: Philipp Ottlinger
>Assignee: Claude Warren
>Priority: Major
>
> Currently rat searches for ASLed files. In order to broaden the usage of this 
> tool I'd like to add a new maven configuration option that allows the 
> definition of a target license.
> This license has to exist in RAT and changes the output of the rat report but 
> does not change the default behaviour for backwards compatibility.
> =OLD REPORT=
> *
> Summary
> ---
> Notes: 1
> Binaries: 187
> Archives: 0
> Standards: 149
> Apache Licensed: 2
> 
> *
>   Files with Apache License headers will be marked AL
> .
> =NEW REPORT=
> If no configuration option is supplied above report stays the same, but may 
> change for different licenses (e.g. GPL / RAT-13).
> ==MVN CONFIGURATION==
> 
>   
>GNU General Public License, version 
> 3
>.
> will lead to the output:
> ==FLEXIBLE REPORT==
> *
> Summary
> ---
> Notes: 1
> Binaries: 187
> Archives: 0
> Standards: 149
> GPL3 Licensed: 2
> 
> *
>   Files with GNU General Public License, version 3 headers will be marked 
> GPL3.
> ...
> Since I do have a patch for RAT-13 this were the next step to realise a 
> multilicense usage of rat.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-211) Generated rat-output.xml must be well-formed, even if BinaryGuesser fails

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838505#comment-17838505
 ] 

Claude Warren commented on RAT-211:
---

This should be fixed in 0.17 when we add the CDATA tag to the sample tag.  

 

[~kkolinko] Do you have a sample document you can share so we can build a 
proper test?

> Generated rat-output.xml must be well-formed, even if BinaryGuesser fails
> -
>
> Key: RAT-211
> URL: https://issues.apache.org/jira/browse/RAT-211
> Project: Apache Rat
>  Issue Type: Bug
>Reporter: Konstantin Kolinko
>Assignee: Claude Warren
>Priority: Major
> Attachments: rat-output.xml
>
>
> This issue was originally reported by Infrastructure team while running RAT 
> over Apache Tomcat source code, see thread
> "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev "at" 
> tomcat.apache.org mailing list. (1)
> The issue:
> ===
> 1. Buildbot at ASF is configured to run RAT tool over tomcat-trunk, tomcat-8, 
> tomcat-7 source code.
> 2. Tomcat has \*.bmp, \*.dia files in its source code (images used by Windows 
> installer, diagrams in documentation) that RAT failed to recognize as binary.
> 3. RAT generated rat-output.xml file that included header-sample fragments of 
> those *.bmp and *.dia files. Those fragments are actually binary garbage.  
> The result is that a broken XML file was generated.
> 4. XSLT transformation from rat-output.xml into rat-output.html failed.
> I have not seen the actual error printed by XSLT processor, but I confirmed 
> that the file is broken by downloading rat-output.xml and opening it in 
> Firefox. Firefox reported a syntax error.
> Workaround:
> ===
> rat-excludes.txt file in Tomcat source code was updated to exclude
> \*\*/\*.bmp
> \*\*/\*.dia
> References:
> ===
> 1. "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev 
> "at" tomcat.apache.org mailing list.
> http://markmail.org/message/rhrm54ch5omjalt4
> 2. Apache Tomcat links to Buildbot resuls:
> http://tomcat.apache.org/ci.html#Buildbot
> 3. Apache Tomcat source code
> http://tomcat.apache.org/svn.html
> Notes:
> - RAT excludes files in Tomcat source code are at
> res/rat/rat-excludes.txt
> - I know that Buildbot uses Ant to run RAT. The Ant project file for that is 
> not in Tomcat sources, but in Infrastructure configuration (I do not have a 
> link). It can be seen in "shell_5 RAT Report Complete" step during build run. 
> E.g. here:
> https://ci.apache.org/builders/tomcat-trunk/builds/1061
> - I do not know what version of RAT is used by that build slave on Buildbot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (RAT-147) binary guesser design improvement

2024-04-18 Thread Richard Eckart de Castilho (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Eckart de Castilho updated RAT-147:
---
Attachment: unix-newlines.txt.bin
windows-newlines.txt.bin

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Priority: Minor
> Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-147) binary guesser design improvement

2024-04-18 Thread Richard Eckart de Castilho (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838500#comment-17838500
 ] 

Richard Eckart de Castilho commented on RAT-147:


[~claude] ha, I found something:

https://lists.apache.org/thread/bwdbppbnpw6zdqqktwtmflpry53hbsr8

So it looks like the file was {{unix-newlines.txt.bin}} which indeed has a BOM. 
I have attached it here and another similar file.

 [^unix-newlines.txt.bin]  [^windows-newlines.txt.bin] 



> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Priority: Minor
> Attachments: unix-newlines.txt.bin, windows-newlines.txt.bin
>
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (RAT-147) binary guesser design improvement

2024-04-18 Thread Richard Eckart de Castilho (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838498#comment-17838498
 ] 

Richard Eckart de Castilho edited comment on RAT-147 at 4/18/24 7:25 AM:
-

[~claude] I assume [~schor] won't respond, so I'll chime in here. The issue 
probably came up during an Apache UIMA release a few years ago. From reading 
the description, it looks like the build was ok on one machine (probably mine), 
but failed on [~schor]'s machine. I did a bit of digging in the history of the 
UIMA Java SDK and uimaFIT repos. In particular the latter had a release around 
the time this issue was filed. However, I didn't find a file for which an 
exclude was added at the time that would match the characteristics described in 
the issue... sorry.


was (Author: rec):
[~claude] I assume [~schor] won't respond, so I'll chime in here. The issue 
probably came up during an Apache UIMA release a few years ago. From reading 
the description, it looks like the build was ok on [~schor]'s machine, but 
failed on another one (possibly mine or maybe that was even before my time). I 
did a bit of digging in the history of the UIMA Java SDK and uimaFIT repos. In 
particular the latter had a release around the time this issue was filed. 
However, I didn't find a file for which an exclude was added at the time that 
would match the characteristics described in the issue... sorry.

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Priority: Minor
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-147) binary guesser design improvement

2024-04-18 Thread Richard Eckart de Castilho (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838498#comment-17838498
 ] 

Richard Eckart de Castilho commented on RAT-147:


[~claude] I assume [~schor] won't respond, so I'll chime in here. The issue 
probably came up during an Apache UIMA release a few years ago. From reading 
the description, it looks like the build was ok on [~schor]'s machine, but 
failed on another one (possibly mine or maybe that was even before my time). I 
did a bit of digging in the history of the UIMA Java SDK and uimaFIT repos. In 
particular the latter had a release around the time this issue was filed. 
However, I didn't find a file for which an exclude was added at the time that 
would match the characteristics described in the issue... sorry.

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Priority: Minor
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (RAT-211) Generated rat-output.xml must be well-formed, even if BinaryGuesser fails

2024-04-18 Thread Claude Warren (Jira)


 [ 
https://issues.apache.org/jira/browse/RAT-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Claude Warren reassigned RAT-211:
-

Assignee: Claude Warren

> Generated rat-output.xml must be well-formed, even if BinaryGuesser fails
> -
>
> Key: RAT-211
> URL: https://issues.apache.org/jira/browse/RAT-211
> Project: Apache Rat
>  Issue Type: Bug
>Reporter: Konstantin Kolinko
>Assignee: Claude Warren
>Priority: Major
> Attachments: rat-output.xml
>
>
> This issue was originally reported by Infrastructure team while running RAT 
> over Apache Tomcat source code, see thread
> "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev "at" 
> tomcat.apache.org mailing list. (1)
> The issue:
> ===
> 1. Buildbot at ASF is configured to run RAT tool over tomcat-trunk, tomcat-8, 
> tomcat-7 source code.
> 2. Tomcat has \*.bmp, \*.dia files in its source code (images used by Windows 
> installer, diagrams in documentation) that RAT failed to recognize as binary.
> 3. RAT generated rat-output.xml file that included header-sample fragments of 
> those *.bmp and *.dia files. Those fragments are actually binary garbage.  
> The result is that a broken XML file was generated.
> 4. XSLT transformation from rat-output.xml into rat-output.html failed.
> I have not seen the actual error printed by XSLT processor, but I confirmed 
> that the file is broken by downloading rat-output.xml and opening it in 
> Firefox. Firefox reported a syntax error.
> Workaround:
> ===
> rat-excludes.txt file in Tomcat source code was updated to exclude
> \*\*/\*.bmp
> \*\*/\*.dia
> References:
> ===
> 1. "Files to exclude from buildbot rat tests" (started 2016-02-15) at dev 
> "at" tomcat.apache.org mailing list.
> http://markmail.org/message/rhrm54ch5omjalt4
> 2. Apache Tomcat links to Buildbot resuls:
> http://tomcat.apache.org/ci.html#Buildbot
> 3. Apache Tomcat source code
> http://tomcat.apache.org/svn.html
> Notes:
> - RAT excludes files in Tomcat source code are at
> res/rat/rat-excludes.txt
> - I know that Buildbot uses Ant to run RAT. The Ant project file for that is 
> not in Tomcat sources, but in Infrastructure configuration (I do not have a 
> link). It can be seen in "shell_5 RAT Report Complete" step during build run. 
> E.g. here:
> https://ci.apache.org/builders/tomcat-trunk/builds/1061
> - I do not know what version of RAT is used by that build slave on Buildbot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (RAT-147) binary guesser design improvement

2024-04-18 Thread Claude Warren (Jira)


[ 
https://issues.apache.org/jira/browse/RAT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838483#comment-17838483
 ] 

Claude Warren commented on RAT-147:
---

[~schor] Do you have examples of these types of files?  I know this was a 
decade ago but I am hoping we can get a test case built.

> binary guesser design improvement
> -
>
> Key: RAT-147
> URL: https://issues.apache.org/jira/browse/RAT-147
> Project: Apache Rat
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Marshall Schor
>Priority: Minor
>
> A release manager cut a release; RAT was run, all was OK.  Another user tried 
> building from source / tag, and RAT complained of 2 files missing headers.  
> This was traced to the "binary guesser" which read the 1st 200 bytes of a 
> file and "guessed" if it was binary.  The file in question had a UTF-8 
> byte-order mark at the beginning, and was, in fact after that, plain ASCII.  
> The reason for 2 different results: the release manager's OS had a default 
> file encoding set to US-ASCII (as determined by running a small Java program 
> that prints out the value of System.property("file.encoding").  This encoding 
> is for 7-bit ASCII, so the guesser when decoding this gets a malformed 
> exception on the 3 bytes at the beginning of the file.  This causes the 
> guesser to conclude this is a "binary" file which doesn't need to be 
> RAT-checked.  The other user was on a Windows 7 machine, which has the 
> file.encoding defaulting to Cp1252 - which does have code points defined for 
> the first 3 bytes, and therefore doesn't throw any exception.  This makes the 
> guesser guess that  this isn't a binary file, and it checks the file and 
> reports a missing header (the file is test data...).
> Workaround - add the file to the explicit excludes.
> Potential problem - on a machine with default encoding US-ASCII, RAT will 
> improperly skip checking files which perhaps should have headers, if they 
> have a UTF-8 byte-order mark.
> Potential problem #2 - RAT is dependent on the default file encoding setting 
> for part of its behavior, causing differences in what it checks.
> I'm not sure what a good solution would be here.  It might range from 
> eliminating the binary "guesser" that looks at the first 200 bytes of a file, 
> to forcing UTF-8 as the charset to use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)