Ultimately, based on our users, I would think a level of automatic index
optimisation is justified, even if it does provide a slow down for certain
cases where there are more optimal strategies.
Of course, we also need to provide a way to turn those off, and do manual
optimistion (or perhaps hints). Once again, this is exactly what goes on in
database query systems !
I believe JRules 6 has ways to turn various options on and off, mostly
hidden away, with sensible defaults.
On 5/5/06, Edson Tirelli <[EMAIL PROTECTED]> wrote:
All,
Follows some thoughts about the current and future solutions for beta
node memory indexing. Suggestions and opinions are welcome. Sorry for
the long e-mail.
I think a good way to discuss that is making an analogy to databases
systems. As we all know, indexing is a great mechanism for performance
improvements on database queries, but also represent a general overhead
on other operations like insert, updates and deletes. Also, there is a
memory consumption cost involved. A well planned set of indexes is
essential for most enterprise applications and the responsibles for
defining them are usually the DBAs. Once indexes are defined, when a
query is executed against that database, a query planner component is
used by database systems to estimate the best plan to run the query with
the best performance, sometimes using the index, sometimes not.
Well, working memory has the same issues and same thoughts are valid
here. Drools 3 implemented an automatic indexing strategy to index beta
node memories in order to boost performance. Just to have some data to
understand the consequences of it, lets use Manners 64 benchmark test
results on a Pentium IV 3Ghz HT machine with 1,0 Gb memory. This is not
really a detailed benchmark test, but simply some rough numbers in order
to every one understand what happened, what is happening, and possibly
suggest some paths for the future. (times in milliseconds)
Manners 64 without indexes: 135000 millisec to run
Manners 64 with beta node indexes: 10500 millisec to run
So, I think it is obviouse that indexes overall benefits pays off
the overhead to keep them, at least in terms of performance. I'm not
discussing limited memory enviroments here.
During the last few days I started to play with indexing again and
find some odd results. To understand them, let me first explain that a
beta node memory might be indexed by one or more indexes. It all depends
on how many BoundVariableConstraints that beta node will have. For
instance:
MyClass1( $myVal1 : attr1, $myVal2 : attr2 )
MyClass2( attr1 == $myVal1, attr2 != $myVal2, attr3 == "bla", attr4 == 5 )
The above rule would have a join node with two indexed constraints:
attr1 == $myVal1, attr2 != $myVal2.
What happens is that the engine is capable of indexing none of them,
one of them, or both of them. The way it is today, it will always index
by both of them.
Just to investigate what are the indexing effects, I tried to make a
code change to turn off multi-index for right beta memories. So, if a
beta node has more than one constraint, I would elect one (arbitrarily)
and index only that column. The result:
Manners 64 with indexing without multi-index on RIGHT memories: 12000
millisec to run
So, multi-index on right memories are good for manners. Then I
turned on multi-index for right memories and turned off multi-index for
left memories:
Manners 64 with indexing without multi-index on LEFT memories: 9500
millisec to run
So, basically, keeping multi-indexes for left memory on manners does
not pay off the overhead.
As you can realize by now, this is TOTALLY dependent on rules AND
data asserted into memory. For a different ruleset, the results could be
totally different.
Besides that, somedays ago, a drools user reported an odd problem,
that some may remember. He wrote a rule with the following column:
1. alarmdefine(alarmreason==reason, alarmlevel!=level)
The rule was running very slowly for an average dataset (about 20
seconds). I sugested inverting the constraints to:
2. alarmdefine(alarmlevel!=level, alarmreason==reason)
And the rule ran very fast (about 0.6 seconds). And the reason for
this is because Drools has a fixed index ordering from last to first.
So, for the first case above, where two indexes will be used
(alarmreason and alarmlevel), drools will index the join node memory by
[alarmlevel + alarmreason]. Nothing wrong so far, except that the data
asserted into memory had all the same alarmlevel, so the index was
adding overhead, but was completelly useless, as it was indexing the
same value in all objects. (for more details on this subject, please
look into list archive).
Thinking about that, I devised a strategy to do dynamic indexing
priorization. The idea is that at run time, when a new object or tuple
reaches a beta node and looks for matches, the engine will dynamically
elect and order indexes to that specific match. So in the above case, it
would always elect alarmreason as the primary index, but if someone
asserts a fact where the alarmlevel becomes the most restrictive index,
for that instance only, the primary index would be alarmlevel.
The result is that the rule runs in about 0.8 seconds for all cases,
does not matter the way it is written.
Manners 64 with indexing priorization: 12000 millisec
As you can see, this dynamic indexing priorization has an overhead
cost that allows one to not worry about how it is writing the rule, as
long as he can accept a slowdown on performance.
Thinking in all that, I see only 2 ways for drools to follow:
1. The first way is to allow users to fine tune their rules for best
performance. This can be achieved making indexes as fast as possible and
allowing the user to enable/disable/prioritize individual indexes on
their rules. This is easier to implement, as only some syntax sugar
needs to be supported by the parser, and the index factory needs access
to the user definitions.
2. The second way would be to develop a so smart optimizer that it can
decide when to use and when not to use indexes as well as which index to
use in a similar way some databases do.The index priorization strategy I
developed might be a rough start, but it is miles away from being the
solution.
Well, this is how things are on this area of development. My
suggestion is, based on the results, not commit the index priorization
stuff. I will just make a backup of it for future reference, but right
now I think the overhead does not pay off. Better to educate users on
how to write better rules for now.
If anyone has or knows about good material on this area of
knowledge, I would really appreciate links or suggestions.
Let me know what you think.
Regards,
Edson