Hi,
I've just uploaded a Decision Tree module to CPAN. It provides
a pretty bare-bones implementation of the classic ID3 decision
tree building algorithm.
I plan on speaking about this module as part of my presentation
at TPC in July, so I might add a few more features to it if I
decide it needs them in order to be shown in public. See the
"LIMITATIONS" section of the docs.
Since it's a new module, I thought I'd include the docs here.
-Ken
NAME
AI::DecisionTree - Automatically Learns Decision Trees
SYNOPSIS
use AI::DecisionTree;
my $dtree = new AI::DecisionTree;
# A set of training data for deciding whether to play tennis
$dtree->add_instance
(attributes => {outlook => 'sunny',
temperature => 'hot',
humidity => 'high'},
result => 'no');
$dtree->add_instance
(attributes => {outlook => 'overcast',
temperature => 'hot',
humidity => 'normal'},
result => 'yes');
... repeat for several more instances, then:
$dtree->train;
# Find results for unseen instances
my $result = $dtree->get_result
(attributes => {outlook => 'sunny',
temperature => 'hot',
humidity => 'normal'});
DESCRIPTION
The "AI::DecisionTree" module automatically creates
so-called "decision
trees" to explain a set of training data. A decision tree is
a kind of
categorizer that use a flowchart-like process for categorizing new
instances. For instance, a learned decision tree might look like the
following, which classifies for the concept "play tennis":
OUTLOOK
/ | \
/ | \
/ | \
sunny/ overcast \rainy
/ | \
HUMIDITY | WIND
/ \ *no* / \
/ \ / \
high/ \normal / \
/ \ strong/ \weak
*no* *yes* / \
*no* *yes*
(This example, and the inspiration for the
"AI::DecisionTree" module,
come directly from Tom Mitchell's excellent book "Machine Learning",
available from McGraw Hill.)
A decision tree like this one can be learned from training data, and
then applied to previously unseen data to obtain results that are
consistent with the training data.
The usual goal of a decision tree is to somehow encapsulate
the training
data in the smallest possible tree. This is motivated by an "Occam's
Razor" philosophy, in which the simplest possible
explanation for a set
of phenomena should be preferred over other explanations.
Also, small
trees will make decisions faster than large trees, and they are much
easier for a human to look at and understand. One of the
biggest reasons
for using a decision tree instead of many other machine learning
techniques is that a decision tree is a much more scrutable decision
maker than, say, a neural network.
The current implementation of this module uses an extremely simple
method for creating the decision tree based on the training
instances.
It uses an Information Gain metric (based on expected reduction in
entropy) to select the "most informative" attribute at each
node in the
tree. This is essentially the ID3 algorithm, developed by J.
R. Quinlan
in 1986. The idea is that the attribute with the highest Information
Gain will (probably) be the best attribute to split the tree
on at each
point if we're interested in making small trees.
METHODS
new()
Creates a new decision tree object.
add_instance()
Adds a training instance to the set of instances which will
be used to
form the tree. An "attributes" parameter specifies a hash of
attribute-value pairs for the instance, and a "result" parameter
specifies the result.
train()
Builds the decision tree from the list of training instances.
get_result()
Returns the most likely result (from the set of all results given to
"add_instance()") for the set of attribute values given. An
"attributes"
parameter specifies a hash of attribute-value pairs for the
instance. If
the decision tree doesn't have enough information to find a
result, it
will return "undef".
rule_statements()
Returns a list of strings that describe the tree in rule-form. For
instance, for the tree diagram above, the following list would be
returned (though not necessarily in this order - the order is
unpredictable):
if outlook='rain' and wind='strong' -> 'no'
if outlook='rain' and wind='weak' -> 'yes'
if outlook='sunny' and humidity='normal' -> 'yes'
if outlook='sunny' and humidity='high' -> 'no'
if outlook='overcast' -> 'yes'
This can be helpful for scrutinizing the structure of a tree.
Note that while the order of the rules is unpredictable, the
order of
criteria within each rule reflects the order in which the
criteria will
be checked at decision-making time.
LIMITATIONS
A few limitations exist in the current version. All of them could be
removed in future versions - especially with your help. =)
No continuous attributes
In the current implementation, only discrete-valued attributes are
supported. This means that an attribute like "temperature" can have
values like "cool", "medium", and "hot", but using actual
temperatures
like 87 or 62.3 is not going to work. This is because the
values would
split the data too finely - the tree-building process would probably
think that it could make all its decisions based on the exact
temperature value alone, ignoring all other attributes, because each
temperature would have only been seen once.
The usual way to deal with this problem is for the
tree-building process
to figure out how to place the continuous attribute values
into a set of
bins (like "cool", "medium", and "hot") and then build the
tree based on
these bin values. Future versions of "AI::DecisionTree" may provide
support for this. For now, you have to do it yourself.
No support for noisy or inconsistent data
Currently, the tree-building technique cannot handle
inconsistent data.
That is, if you have two contradictory training instances,
the "train()"
method will throw an exception. Future versions may allow
inconsistent
data, finding the decision tree that most closely matches the data
rather than precisely matching it.
No support for tree-trimming
Most decision tree building algorithms use a two-stage
building process
- first a tree is built that completely fits the training
data (or fits
it as closely as possible if noisy data is supported), and
then the tree
is pruned so that it will generalize well to new instances.
This might
be done either by maximizing performance on a set of
held-out training
instances, or by pruning parts of the tree that don't seem
like they'll
be very valuable.
Currently, we build a tree that completely fits the training
data, and
we don't prune it. That means that the tree may overfit the training
data in many cases - i.e., you won't be able to see the
forest for the
trees (or, more accurately, the tree for the leaves).
This is mainly a problem when you're using "real world" or
noisy data.
If you're using data that you know to be a result of a rule-based
process and you just want to figure out what the rules are,
the current
implementation should work fine for you.
AUTHOR
Ken Williams, [EMAIL PROTECTED]
SEE ALSO
Mitchell, Tom (1997). Machine Learning. McGraw-Hill. pp 52-80.
Quinlan, J. R. (1986). Induction of decision trees. Machine
Learning,
1(1), pp 81-106.
the perl manpage.