Greetings,

The code base has the ability to read archive files.  It is only used to
create a "walker" to read archives passed in on the command line.  I
propose that we modify the processing of ARCHIVE type files to scan them
for licences.

*Proposal:*
What I propose is that we extract each file in the ARCHIVE as a Document
and process it.  Any results from processing the Document will be added to
the archive's Document instance.  So any licenses found in files within the
archive are reported as licenses for the archive.

Processing of archives will exclude files listed in the filesToIgnore
ReportConfiguration property.
Processing of archives will NOT exclude directories listed in the
directoriesToIgnore ReportConfiguration property

*Backward Compatibility:*
To keep this from breaking existing Rat execution I propose a new
configuration option and an enumeration of values for that option.

enum Processing { NOTIFICATION, PRESENCE, ABSENCE}

NOTIFICATION - The default.  The current level of reporting where we just
count the archives and list them in the report.  No internal processing of
the archive.

PRESENCE - Report the presence of any licenses found in the archive.  In
this case we ignore any UNKNOWN license entries and only report the
licenses found.

ABSENCE - like PRESENCE but adding the reporting of UNKNOWN licenses.

the command line option "--archive" will be used to set the property, the
value of the property is not case sensitive.

*Examples:*
"--archive NOTIFICATION" will execute exactly as it RAT does now
"--archive Presence" will report any known licenses found in the archive.
"--archive absence" will report presence of any licenses found as well as
detection of files without licenses.

*XML output changes:*
Currently archives are listed in the XML output as:

<resource name='src/test/resources/elements/dummy.jar' type='ARCHIVE'/>"

This proposal would, in cases where licenses are discovered, add the
license entries as with STANDARD resource types.  For example:

<resource name='src/test/resources/elements/dummy.jar' type='ARCHIVE'>"
  <license approval="true" family="MIT  " id="MIT" name="The MIT License"/>
  <license approval="true" family="AL   " id="AL" name="Apache License
Version 2.0"/>
</resource>

*POC*:
I have a POC that implements minor changes to the tika based code base.
The changes are
modified:   apache-rat-core/src/main/java/org/apache/rat/Report.java
modified:
apache-rat-core/src/main/java/org/apache/rat/analysis/DefaultAnalyserFactory.java
modified:
apache-rat-core/src/main/java/org/apache/rat/report/xml/XmlReportFactory.java
modified:
apache-rat-core/src/main/java/org/apache/rat/walker/ArchiveWalker.java
modified:
apache-rat-core/src/test/java/org/apache/rat/analysis/DefaultAnalyserFactoryTest.java

The changes include changing ArchiveWalker code to use more current
commons-compress capabilities for archived type detection and reading.

Other changes are to support additional method arguments.

Thoughts?
Claude

-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to