*Current state* We attempt to provide a default configuration that is ASF requirements based.
We currently categorize documents into one of six categories: Generated, Unknown, Archive, Notice, Binary, Standard. - Standard documents get scanned for the presence or absence of license headers. - Archive documents may get scanned for the presence or absence of license headers. - Notice files are determined by file name [1] and are excluded from processing - Generated files are determined by content scanning for key phrases and are excluded from processing. - Unknown files are files that can not otherwise be categorized. - Binary files are noted but not processed. We have filters to remove documents from processing based on filename or directory. But no good access to them via the command line. We have some filters (at least in the Maven plugin) that will remove files based on their use in source code control systems. We are using Tika to categorize the documents. Tika produces a mime type for each document. *Proposed Changes* 1. Remove the NOTICE category. I think that the Notice concept is incorrect and should be handled by other means. Some of the notices (e.g. AUTHOR, AUTHOR.TXT, UPGRADE, UPGRADE.TXT) could simply be excluded by the file name exclusion process. Other notices (e.g. LICENSE, LICENSE.txt) should be scanned to determine what license is specified. The change to scanning LICENSE files would significantly help the Archive processor. 2. Deprecate the -e, --exclude, -E, --exclude-file, --scan-hidden-directories command line arguments in favor of: - --exclude-file-literal to exclude literal file name (e.g. "AUTHOR.TXT") - --exclude-file-wildcard to exclude files based on file wildcards (e.g. "AUTHOR.*") - --exclude-file-regex to exclude files based on regular expressions (e.g. "AUTHORS?(\.[Tt][Xx][Tt])?") - --exclude-dir-literal to exclude literal directory names - --exclude-dir-wildcard to exclude wildcard directory names - --exclude-dir-regex to exclude directories based on regular expressions. - --exclude-contents-literal to exclude files based on a literal match to text in the file. - --exclude-contents-regex tp exclude files based on a regular expression match to text in the file. - --exclude-source to exclude files and directories based on the input of a file. Multiple file structures could be accepted but in general it has a flag for the file/directory/contents trichotomy and the literal/wildcard/regex trichotomy. For example: file:literal:AUTHOR.TXT or <contents><literal>Generated by</literal></contents> - --no-default-exclude to remove any exclusions that are included by default. 3. Add a count of all excluded files to the XML report. This should include counts broken down by the exclude type file, directory, contents and literal, wildcard, regex. 4. Remove the GENERATED category. This is actually an exclusion based on content and is handled above. 5. Add some processing to the BINARY files. These files include image and audio files that may have licensing information in their metadata. Add "--binary <ProcessingType>" command line argument similar to "--archive <ProcessingType>" [2] to describe how to handle binary files and use the Tika capabilities to extract the text and/or metadata for processing like ARCHIVE processing. 6. Add the mime type to the Resource element in the XML output as it can help in detailed reporting. This will reduce the number of document types to four: UNKNOWN, BINARY, ARCHIVE, STANDARD; and will simplify the processing of documents while giving users the ability to fine tune how files are processed. Thoughts? Claude [1] https://github.com/apache/creadur-rat/blob/master/apache-rat-core/src/main/java/org/apache/rat/document/impl/guesser/NoteGuesser.java [2] https://github.com/apache/creadur-rat/pull/246 -- LinkedIn: http://www.linkedin.com/in/claudewarren