Re: Spam package redesign

Janne Jalkanen Mon, 28 Sep 2009 00:11:53 -0700

Yah, it's just that smaller numbers are usually easier to deal with(numbers from 1-20). We need more resolution at the low end, not atthe high end. That's my preference for using floats.

A simple way to do as you suggest would be to have a Score returnvalue, e.g.


public Score inspect(...)
{
    // Detection score goes here

    if( isSpam )
        return new Score( "mywonderfulspamdetectortest", 0.5f );

    return Score.OK; // where score.OK is defined to be 0.0f.
}

where the first String is a key and the float is a default value,which can be overridden in a config file using the key. This would bepretty close to the SpamAssassin model (and close to thejava.util.Properties conceptually, so it would be easy to adapt).

(The reason why I'm proposing a default value is so that when you adda new Inspector, you don't have to modify two files.)

After each Inspector would be called, then the parent would call thecallback I proposed, and that would then determine what to do. Inpseudocode:


// JSPWiki internal code.

private void inspect( DecisionMaker decisionMaker, Change change,<some other values> )

{
    ArrayList scores;
    for( Inspector inspector : m_inspectorChain )
    {
        Score s = inspector.inspect( change, <some values> );

        scores.add( s );

        if( decisionMaker.visit( scores ) == Decision.SPAM ) { ... }
    }
}

We can then provide something like this

class SpamFilter implements PageFilter, DecisionMaker
{
    ...

    // This emulates the 2.8.x SpamFilter functionality.
    public Decision visit( List<Score> scores )
    {
        if( scores.last().getValue() > 0.0 ) return Decision.SPAM;

        return Decision.CONTINUE;
    }
}

(At any rate, it might make sense to reuse some classes from theworkflow package here; after all, this is a machine workflow in asense, is it not?)

(Another possible pattern would be to have the DecisionMaker throw anexception. Don't quite know what's better.)


/Janne

On Sep 28, 2009, at 02:04 , Andrew Jaquith wrote:

It's only more complicated in the sense that I'd have to change some
recently-written code. :)

Both integer and floating point -- as you point out -- are really the
same except for the choice of where to put the decimal point. I chose
integer to preserve the same scoring strategy from the previous
design. Basically: add stuff up, and if the total exceeds a theshold,
you've got spam.

I took a quick look at SpamAssassin. They do use floating point
scores. I like the way they allow weights to be customized for each
rule. That seems like it would be a good enhancement for us, too.

So, it seems to me that if we go floating-point, we should make it
possible to configure custom weights for each Inspector. That also
implies that it would be better to simply have Inspector.inspect()
simply return a "pass/fail/no result" result and have the parent
Inspection object figure out what to add or subtract from the running
total. This would mean that the Inspectors wouldn't be able to modify
scores directly. And it would be closer to the SpamAssessin model, and
would probably simpler to code for too.

Reasonable?

On Sun, Sep 27, 2009 at 4:12 PM, Janne Jalkanen
<[email protected]> wrote:
Heya!
I don't quite understand - why would a floating point value be anymore
complicated than an integer?
Note that at the moment we mark a modification a spam if a singletestfails, which essentially means doing logical OR more than any sortof anaddition (=threshold is always 1). The integer is there more orless assomething-which-was-never-completed-and-I-am-almost-sure-is-the-wrong-design.In fact, I did lament the fact that you can't say that "X is a bitmoreefficient than Y" - with an integer-based design you essentially goforfixed point arithmetics (test A gives you 100, test B gives you150, etc,where you just shifted decimal points to the right just for theheck of it).
If you take a look at SpamAssassin, it uses floats.

I'd hesitate changing it later, since it will create an API contract.

/Janne

On Sep 27, 2009, at 22:34 , Andrew Jaquith wrote:
Janne --- good critique, and good suggestions. I've killed the
limit-setting methods on the Inspection class itself in favor of alistener
strategy. I adopteed your other suggestions too, except...
...at the moment I am inclined to leave the scoring scale asinteger-basedfor now. I like it because it is simple, and resembles what we donow. We
can change it later if need be.
In answer to your first question: the inspector package does nothelpgraduation, although it does help the beta that comes after thediploma. :)
Andrew
On Sep 27, 2009, at 4:37, Janne Jalkanen<[email protected]> wrote:
My main concern: does this advance our graduation?

Others inline.

On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
The "inspect" package provides facilities that allow content to be
inspected and scored based on various criteria such as whether awikipage change contains spam. Content may be scored by initializinganInspection object and then calling the {...@linkInspection#inspect()}
method. The {...@code inspect} method, in turn, iterates through a
variety of previously-configured {...@link Inspector} objects andcallsthe {...@link Inspector#inspect(String,String)} method for eachone. Eachof the configured inspectors is free to perform whateverevaluationsit wishes, and can increase, decrease or reset the "scores" forany
score category, for example the {...@link Score#SPAM} category.
Ok.
* Score objects supply instructions to the parent Inspectionobject to
increment, decrement or reset the score for a particular category.
Each Score object is constructed with a category (for example,
Score.SPAM), an integer indicating how much to change the score,andan optional String message that provides context for the change.For
example, a Score that increments the spam score by 1 could be
constructed by new Score( Score.SPAM, 1, "Bot detected." ).Negativenumbers can be supplied also to decrease the score. Forconvenience,
{...@link Score#INCREMENT_SCORE} means "add 1", {...@link
Score#DECREMENT_SCORE} means "subtract 1", and {...@link Score#RESET}
means "reset to zero."
Smells of overdesign to me. Creating a new operation andmethodology tosupport addition sounds like a bad idea. Not to mention the factthat you
might actually want to use fractions.
Something like inspection.changeScore( Score.SPAM, 0.5f ) soundsmore
appealing to me.
public String preSave( WikiContext context, String content )throws
RedirectException
{
Change change = getChange( context, content );

// Run the Inspection
Inspection inspection = new Inspection( context, m_config,m_inspectors
);
inspection.addLimit( Score.SPAM, m_scoreLimit );
inspection.inspect( content, change );
int spamScore = inspection.getScore( Score.SPAM );
context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );

// Redirect user if score too high
if( inspection.isFailed() )
{
 ...
}
...
}
Yes, this looks unnecessarily complicated and limiting to me. Iwouldremove automatic limit-rating altogether, and would justconcentrate onrefactoring the individual methods from SpamFilter into theinspect-packagewith a light wrapper around them. I think a callback design isbetter thanadding quite limited limit thingies anyway, as this gives fullprogrammaticcontrol to the developer without us having to pre-guess thepossible
limitations.
E.g. inspection.inspect( content, change, newInspectionResultListener()
{
 public boolean visit(Inspection ins)
 {
      // Any possible decision code goes here
      return ins.getScore() > m_maxScore;
 }
});
Obviously, these listeners would be optional, and you could justlet theentire chain run without interference. And for simplicity, we canprovide
some default listeners, e.g.
inspection.inspect( content, change, newStopAtScoreListener( Score.SPAM,
5 ) );
(Yeah, I'm a big fan of this pattern 'cos it gives much morecontrol to
the developer and eases debugging immensely.)
ChangeRateInspector, LinkCountInspector, SpamInspector (forAkismet),
This should obviously be called AkismetInspector; all the othersdetect
spammers too (and Akismet is probably the worst of the lot anyway).

/Janne

Re: Spam package redesign

Reply via email to