Rat spends a lot of effort parsing textual documents, looking for headers and boilerplate text. There's an extension point (of sorts) for the searches that can be performed, provided by IHeaderMatcher[1].

This interface has a few TODOs in. It's used by pushing the text in one line at a time, after doing some pre-processing. As the TODO indicates, this may not the most elegant design.

As an extension point, IHeaderMatcher has the advantage of flexibility. It would be possible to plug in radically different implementations. It turns out, though, that few clever new implementations have emerge. All implementations seem to do is check for license headers.

One disadvantage of this arrangement is that it pushes some of the parsing outwards toward supposedly pluggable implementations. This means that adding new licenses means adding a partial parser.

I wonder whether it might be more intuitive (as well as opening potential for faster parsing) to use immutable domain objects for licenses and so on, making them data rather than processors.

Opinions...? Alternatives...?

Robert

[1]
/**
* Resets this matches.
* Subsequent calls to {@link #match} will accumulate new text.
*/
public void reset();

/**
* Matches the text accumulated to licenses.
* TODO probably a poor design choice - hope to fix later
* @param subject TODO
* @param line next line of text, not null
* @return TODO
*/
public boolean match(Document subject, String line) throws RatHeaderAnalysisException;

http://svn.apache.org/viewvc/creadur/rat/trunk/apache-rat-core/src/main/java/org/apache/rat/analysis/IHeaderMatcher.java?revision=1396305&view=markup

Reply via email to