Greetings, The code base has the ability to read archive files. It is only used to create a "walker" to read archives passed in on the command line. I propose that we modify the processing of ARCHIVE type files to scan them for licences.
*Proposal:* What I propose is that we extract each file in the ARCHIVE as a Document and process it. Any results from processing the Document will be added to the archive's Document instance. So any licenses found in files within the archive are reported as licenses for the archive. Processing of archives will exclude files listed in the filesToIgnore ReportConfiguration property. Processing of archives will NOT exclude directories listed in the directoriesToIgnore ReportConfiguration property *Backward Compatibility:* To keep this from breaking existing Rat execution I propose a new configuration option and an enumeration of values for that option. enum Processing { NOTIFICATION, PRESENCE, ABSENCE} NOTIFICATION - The default. The current level of reporting where we just count the archives and list them in the report. No internal processing of the archive. PRESENCE - Report the presence of any licenses found in the archive. In this case we ignore any UNKNOWN license entries and only report the licenses found. ABSENCE - like PRESENCE but adding the reporting of UNKNOWN licenses. the command line option "--archive" will be used to set the property, the value of the property is not case sensitive. *Examples:* "--archive NOTIFICATION" will execute exactly as it RAT does now "--archive Presence" will report any known licenses found in the archive. "--archive absence" will report presence of any licenses found as well as detection of files without licenses. *XML output changes:* Currently archives are listed in the XML output as: <resource name='src/test/resources/elements/dummy.jar' type='ARCHIVE'/>" This proposal would, in cases where licenses are discovered, add the license entries as with STANDARD resource types. For example: <resource name='src/test/resources/elements/dummy.jar' type='ARCHIVE'>" <license approval="true" family="MIT " id="MIT" name="The MIT License"/> <license approval="true" family="AL " id="AL" name="Apache License Version 2.0"/> </resource> *POC*: I have a POC that implements minor changes to the tika based code base. The changes are modified: apache-rat-core/src/main/java/org/apache/rat/Report.java modified: apache-rat-core/src/main/java/org/apache/rat/analysis/DefaultAnalyserFactory.java modified: apache-rat-core/src/main/java/org/apache/rat/report/xml/XmlReportFactory.java modified: apache-rat-core/src/main/java/org/apache/rat/walker/ArchiveWalker.java modified: apache-rat-core/src/test/java/org/apache/rat/analysis/DefaultAnalyserFactoryTest.java The changes include changing ArchiveWalker code to use more current commons-compress capabilities for archived type detection and reading. Other changes are to support additional method arguments. Thoughts? Claude -- LinkedIn: http://www.linkedin.com/in/claudewarren