*Current state*

We attempt to provide a default configuration that is ASF requirements
based.

We currently categorize documents into one of six categories: Generated,
Unknown, Archive, Notice, Binary, Standard.

   - Standard documents get scanned for the presence or absence of license
   headers.
   - Archive documents may get scanned for the presence or absence of
   license headers.
   - Notice files are determined by file name [1]  and are excluded from
   processing
   - Generated files are determined by content scanning for key phrases and
   are excluded from processing.
   - Unknown files are files that can not otherwise be categorized.
   - Binary files are noted but not processed.


We have filters to remove documents from processing based on filename or
directory.  But no good access to them via the command line.

We have some filters (at least in the Maven plugin) that will remove files
based on their use in source code control systems.

We are using Tika to categorize the documents.  Tika produces a mime type
for each document.

*Proposed Changes*

   1. Remove the NOTICE category.  I think that the Notice concept is
   incorrect and should be handled by other means.  Some of the notices (e.g.
   AUTHOR, AUTHOR.TXT,  UPGRADE, UPGRADE.TXT) could simply be excluded by the
   file name exclusion process.  Other notices (e.g. LICENSE, LICENSE.txt)
   should be scanned to determine what license is specified.  The change to
   scanning LICENSE files would significantly help the Archive processor.
   2. Deprecate the -e, --exclude, -E, --exclude-file,
   --scan-hidden-directories command line arguments in favor of:
      - --exclude-file-literal to exclude literal file name (e.g.
      "AUTHOR.TXT")
      - --exclude-file-wildcard to exclude files based on file wildcards
      (e.g. "AUTHOR.*")
      - --exclude-file-regex to exclude files based on regular expressions
      (e.g. "AUTHORS?(\.[Tt][Xx][Tt])?")
      - --exclude-dir-literal to exclude literal directory names
      - --exclude-dir-wildcard to exclude wildcard directory names
      - --exclude-dir-regex to exclude directories based on regular
      expressions.
      - --exclude-contents-literal to exclude files based on a literal
      match to text in the file.
      - --exclude-contents-regex tp exclude files based on a regular
      expression match to text in the file.
      - --exclude-source to exclude files and directories based on the
      input of a file.  Multiple file structures could be accepted but
in general
      it has a flag for the file/directory/contents trichotomy and the
      literal/wildcard/regex trichotomy. For example:
file:literal:AUTHOR.TXT or
      <contents><literal>Generated  by</literal></contents>
      - --no-default-exclude to remove any exclusions that are included by
      default.
   3. Add a count of all excluded files to the XML report.  This should
   include counts broken down by the exclude type file, directory, contents
   and literal, wildcard, regex.
   4. Remove the GENERATED category.  This is actually an exclusion based
   on content and is handled above.
   5. Add some processing to the BINARY files.  These files include image
   and audio files that may have licensing information in their metadata.  Add
   "--binary <ProcessingType>" command line argument similar to "--archive
   <ProcessingType>"  [2] to describe how to handle binary files and use the
   Tika capabilities to extract the text and/or metadata for processing like
   ARCHIVE processing.
   6. Add the mime type to the Resource element in the XML output as it can
   help in detailed reporting.

This will reduce the number of document types to four: UNKNOWN, BINARY,
ARCHIVE, STANDARD; and will simplify the processing of documents while
giving users the ability to fine tune how files are processed.

Thoughts?
Claude

[1]
https://github.com/apache/creadur-rat/blob/master/apache-rat-core/src/main/java/org/apache/rat/document/impl/guesser/NoteGuesser.java
[2] https://github.com/apache/creadur-rat/pull/246
-- 
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to