Re: Spam package redesign

Andrew Jaquith Wed, 07 Oct 2009 18:33:01 -0700

I just checked in the org.apache.wiki.content.inspect package. It is
very close to what we agreed on before. I incorporated the listener
strategy Janne suggested. An additional key change was the creation of
the "InspectionPlan" class where the number, order, and weighting for
the Inspector classes are organized. Thus, you could create a "spam"
InspectionPlan with a defined stack of Inspectors. I've provided one
for this purpose that is created as a singleton-per-wiki and
instantiated at startup based on jspwiki.properties.


All of this is well documented in the package I checked in. If you
want to see the package.html javadoc, it is here:

https://svn.apache.org/repos/asf/incubator/jspwiki/trunk/src/java/org/apache/wiki/content/inspect/package.html

There are a few caveats with this initial release. First, I haven't
yet written an Akismet Inspector, or ones for various Captcha
implementations. Both of these are coming next -- it should be
straightforward to refactor the existing code into them. Second, I
haven't yet reworked Comment.jsp or the FCK editor JSP to work with
the new inspection API. Also, I don't yet have default "weights" for
the inspectors in the jspwiki.properties template. All of these things
will come in future checkins.

But in general, I'm fairly pleased with how it turned out. It is very
extensible and flexible, but also very SIMPLE to understand. Also,
I've written some reasonably extensive unit tests that were a lot more
comprehensive for testing the various Inspectors than what we had for
the monolithic SpamFilter.

As always, comments are welcome.

Andrew

On Mon, Sep 28, 2009 at 11:24 AM, Andrew Jaquith
<[email protected]> wrote:
> There's some room for optimization in what you proposed, but your
> p-code is close to what I was thinking. in essence, it is the Strategy
> pattern that calls the inspector logic, combined with a listener for
> scoring changes (aka Visitor pattern) that interested callers would
> subscribe to. That works.
>
> After having originally suggested an "InspectionInterruptedException"
> so that inspectors could abort a running inspection chain, I now think
> it's better to simply return an object from the listener/visited
> method, rather than throw an exception. Exceptions could be perceived
> by an implementer as a little trickier to debug. I'd rather make the
> API simpler.
>
> FYI, the workflow package isn't totally suitable here. When you cited
> the Decision class, I suspect you actually meant the Outcome class,
> no?
>
> Ok. I've got what I need to do this properly. This should work pretty well.
>
> Andrew
>
> On Mon, Sep 28, 2009 at 3:11 AM, Janne Jalkanen
> <[email protected]> wrote:
>>
>> Yah, it's just that smaller numbers are usually easier to deal with (numbers
>> from 1-20).  We need more resolution at the low end, not at the high end.
>>  That's my preference for using floats.
>>
>> A simple way to do as you suggest would be to have a Score return value,
>> e.g.
>>
>> public Score inspect(...)
>> {
>>    // Detection score goes here
>>
>>    if( isSpam )
>>        return new Score( "mywonderfulspamdetectortest", 0.5f );
>>
>>    return Score.OK; // where score.OK is defined to be 0.0f.
>> }
>>
>> where the first String is a key and the float is a default value, which can
>> be overridden in a config file using the key. This would be pretty close to
>> the SpamAssassin model (and close to the java.util.Properties conceptually,
>> so it would be easy to adapt).
>>
>> (The reason why I'm proposing a default value is so that when you add a new
>> Inspector, you don't have to modify two files.)
>>
>> After each Inspector would be called, then the parent would call the
>> callback I proposed, and that would then determine what to do.  In
>> pseudocode:
>>
>> // JSPWiki internal code.
>> private void inspect( DecisionMaker decisionMaker, Change change, <some
>> other values> )
>> {
>>    ArrayList scores;
>>    for( Inspector inspector : m_inspectorChain )
>>    {
>>        Score s = inspector.inspect( change, <some values> );
>>
>>        scores.add( s );
>>
>>        if( decisionMaker.visit( scores ) == Decision.SPAM ) { ... }
>>    }
>> }
>>
>> We can then provide something like this
>>
>> class SpamFilter implements PageFilter, DecisionMaker
>> {
>>    ...
>>
>>    // This emulates the 2.8.x SpamFilter functionality.
>>    public Decision visit( List<Score> scores )
>>    {
>>        if( scores.last().getValue() > 0.0 ) return Decision.SPAM;
>>
>>        return Decision.CONTINUE;
>>    }
>> }
>>
>> (At any rate, it might make sense to reuse some classes from the workflow
>> package here; after all, this is a machine workflow in a sense, is it not?)
>>
>> (Another possible pattern would be to have the DecisionMaker throw an
>> exception. Don't quite know what's better.)
>>
>> /Janne
>>
>> On Sep 28, 2009, at 02:04 , Andrew Jaquith wrote:
>>
>>> It's only more complicated in the sense that I'd have to change some
>>> recently-written code. :)
>>>
>>> Both integer and floating point -- as you point out -- are really the
>>> same except for the choice of where to put the decimal point. I chose
>>> integer to preserve the same scoring strategy from the previous
>>> design. Basically: add stuff up, and if the total exceeds a theshold,
>>> you've got spam.
>>>
>>> I took a quick look at SpamAssassin. They do use floating point
>>> scores. I like the way they allow weights to be customized for each
>>> rule. That seems like it would be a good enhancement for us, too.
>>>
>>> So, it seems to me that if we go floating-point, we should make it
>>> possible to configure custom weights for each Inspector. That also
>>> implies that it would be better to simply have Inspector.inspect()
>>> simply return a "pass/fail/no result" result and have the parent
>>> Inspection object figure out what to add or subtract from the running
>>> total. This would mean that the Inspectors wouldn't be able to modify
>>> scores directly. And it would be closer to the SpamAssessin model, and
>>> would probably simpler to code for too.
>>>
>>> Reasonable?
>>>
>>> On Sun, Sep 27, 2009 at 4:12 PM, Janne Jalkanen
>>> <[email protected]> wrote:
>>>>
>>>> Heya!
>>>>
>>>> I don't quite understand - why would a floating point value be any more
>>>> complicated than an integer?
>>>>
>>>> Note that at the moment we mark a modification a spam if a single test
>>>> fails, which essentially means doing logical OR more than any sort of an
>>>> addition (=threshold is always 1).  The integer is there more or less as
>>>>
>>>> something-which-was-never-completed-and-I-am-almost-sure-is-the-wrong-design.
>>>> In fact, I did lament the fact that you can't say that "X is a bit more
>>>> efficient than Y" - with an integer-based design you essentially go for
>>>> fixed point arithmetics (test A gives you 100, test B gives you 150, etc,
>>>> where you just shifted decimal points to the right just for the heck of
>>>> it).
>>>>
>>>> If you take a look at SpamAssassin, it uses floats.
>>>>
>>>> I'd hesitate changing it later, since it will create an API contract.
>>>>
>>>> /Janne
>>>>
>>>> On Sep 27, 2009, at 22:34 , Andrew Jaquith wrote:
>>>>
>>>>> Janne --- good critique, and good suggestions. I've killed the
>>>>> limit-setting methods on the Inspection class itself in favor of a
>>>>> listener
>>>>> strategy. I adopteed your other suggestions too, except...
>>>>>
>>>>> ...at the moment I am inclined to leave the scoring scale as
>>>>> integer-based
>>>>> for now. I like it because it is simple, and resembles what we do now.
>>>>> We
>>>>> can change it later if need be.
>>>>>
>>>>> In answer to your first question: the inspector package does not help
>>>>> graduation, although it does help the beta that comes after the diploma.
>>>>> :)
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Sep 27, 2009, at 4:37, Janne Jalkanen <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> My main concern: does this advance our graduation?
>>>>>>
>>>>>> Others inline.
>>>>>>
>>>>>> On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
>>>>>>
>>>>>>> The "inspect" package provides facilities that allow content to be
>>>>>>> inspected and scored based on various criteria such as whether a wiki
>>>>>>> page change contains spam. Content may be scored by initializing an
>>>>>>> Inspection object and then calling the {...@link Inspection#inspect()}
>>>>>>> method. The {...@code inspect} method, in turn, iterates through a
>>>>>>> variety of previously-configured {...@link Inspector} objects and calls
>>>>>>> the {...@link Inspector#inspect(String,String)} method for each one. 
>>>>>>> Each
>>>>>>> of the configured inspectors is free to perform whatever evaluations
>>>>>>> it wishes, and can increase, decrease or reset the "scores" for any
>>>>>>> score category, for example the {...@link Score#SPAM} category.
>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>>> * Score objects supply instructions to the parent Inspection object to
>>>>>>> increment, decrement or reset the score for a particular category.
>>>>>>> Each Score object is constructed with a category (for example,
>>>>>>> Score.SPAM), an integer indicating how much to change the score, and
>>>>>>> an optional String message that provides context for the change. For
>>>>>>> example, a Score that increments the spam score by 1 could be
>>>>>>> constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
>>>>>>> numbers can be supplied also to decrease the score. For convenience,
>>>>>>> {...@link Score#INCREMENT_SCORE} means "add 1", {...@link
>>>>>>> Score#DECREMENT_SCORE} means "subtract 1", and {...@link Score#RESET}
>>>>>>> means "reset to zero."
>>>>>>
>>>>>> Smells of overdesign to me.  Creating a new operation and methodology
>>>>>> to
>>>>>> support addition sounds like a bad idea. Not to mention the fact that
>>>>>> you
>>>>>> might actually want to use fractions.
>>>>>>
>>>>>> Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds more
>>>>>> appealing to me.
>>>>>>
>>>>>>> public String preSave( WikiContext context, String content ) throws
>>>>>>> RedirectException
>>>>>>> {
>>>>>>> Change change = getChange( context, content );
>>>>>>>
>>>>>>> // Run the Inspection
>>>>>>> Inspection inspection = new Inspection( context, m_config,
>>>>>>> m_inspectors
>>>>>>> );
>>>>>>> inspection.addLimit( Score.SPAM, m_scoreLimit );
>>>>>>> inspection.inspect( content, change );
>>>>>>> int spamScore = inspection.getScore( Score.SPAM );
>>>>>>> context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>>>>>>>
>>>>>>> // Redirect user if score too high
>>>>>>> if( inspection.isFailed() )
>>>>>>> {
>>>>>>>  ...
>>>>>>> }
>>>>>>> ...
>>>>>>> }
>>>>>>
>>>>>> Yes, this looks unnecessarily complicated and limiting to me.  I would
>>>>>> remove automatic limit-rating altogether, and would just concentrate on
>>>>>> refactoring the individual methods from SpamFilter into the
>>>>>> inspect-package
>>>>>> with a light wrapper around them.  I think a callback design is better
>>>>>> than
>>>>>> adding quite limited limit thingies anyway, as this gives full
>>>>>> programmatic
>>>>>> control to the developer without us having to pre-guess the possible
>>>>>> limitations.
>>>>>>
>>>>>> E.g. inspection.inspect( content, change, new
>>>>>> InspectionResultListener()
>>>>>> {
>>>>>>  public boolean visit(Inspection ins)
>>>>>>  {
>>>>>>      // Any possible decision code goes here
>>>>>>      return ins.getScore() > m_maxScore;
>>>>>>  }
>>>>>> });
>>>>>>
>>>>>> Obviously, these listeners would be optional, and you could just let
>>>>>> the
>>>>>> entire chain run without interference. And for simplicity, we can
>>>>>> provide
>>>>>> some default listeners, e.g.
>>>>>>
>>>>>> inspection.inspect( content, change, new StopAtScoreListener(
>>>>>> Score.SPAM,
>>>>>> 5 ) );
>>>>>>
>>>>>> (Yeah, I'm a big fan of this pattern 'cos it gives much more control to
>>>>>> the developer and eases debugging immensely.)
>>>>>>
>>>>>>> ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),
>>>>>>
>>>>>> This should obviously be called AkismetInspector; all the others detect
>>>>>> spammers too (and Akismet is probably the worst of the lot anyway).
>>>>>>
>>>>>> /Janne
>>>>
>>>>
>>
>>
>

Re: Spam package redesign

Reply via email to