Hi Derek,

I further investigated the implemented algorithm during the last months using the transactions from my personal account. I wrote a simple application which simulates the import of all my transactions one by one in chronological order. For each transaction this application compares the matched account calculated by the implemented bayes algorithm with the account I manually assigned the transaction to. In case of a mismatch the application provides several debug information like the calculated probabilities for each account and the frequencies of each token for instance to see what's going wrong.

What I found out was very surprising to me. Here are my results:

* the calculated probabilities for each account P(A_k | T_1, ..., T_N) are
  always whether exactly 1 or almost 0, i.e. the matched account is actually a
  binary decision
* sometimes there is more than one account with probability 1 and the finally
  selected matching account (function highest_probability()) can be the
  correct one or not, it's random
* the condition for a probability of 1 is, that there exists a token, which is
  unique for a single account, i.e. all transactions containing this token have
  always been assigned to the same account
* if there are several tokens, which are unique for different accounts, all
  these accounts have probability 1
* if there is no token, which is unique to one account, the probability is
  almost 0 for all accounts

That means, as soon as there is no unique token, which can be used for classification, the implemented algorithm fails. Moreover if the relevant tokens are not unique, but there are unique tokens, which are actually irrelevant, these irrelevant tokens are used for classification. This was the most common observed case for wrong account selection, since the irrelevant tokens where mostly not assigned to the correct account. In many cases these irrelevant tokens had frequencies of 1, i.e. there was only one transaction before with randomly the same token.

In addition this would mean, that the bayes classifier is actually needless, since the same classification results can be achieved with a much simpler rule-based decision algorithm and without any probalistic approach.

These observations encourage my suspicion, that something is wrong with the implemented algorithm.

Christian


Am 06.07.20 um 01:05 schrieb Christian Gruber:

Am 02.07.20 um 21:28 schrieb Derek Atkins:
Hi,

On Thu, July 2, 2020 3:10 pm, Christian Gruber wrote:
Hi,

while further studying the bayesian import matching algorithm I'm now at
the point, where I wanted to understand, how the bayes formula is
applied to the problem of matching transactions to accounts using
tokens. But I need further information, since it doesn't come clear to
me what is really calculated there.

The implementation can be found in the following functions in Account.cpp:

   * get_first_pass_probabilities()
   * build_probabilities()
   * highest_probability()

Actually, the latter could be omitted as it only selects the account
with the highest matching probability.

Studying the code and the rare comments on the implementation it seems
to be a variant of the naive bayes classifier
<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Probabilistic_model>
with the tokens used as (independent) "features" and the accounts used
as "classes". But comparing this algorithm to the code leaves several
questions open.

Does anybody know a more precise algorithm description, on which the
implementation in GnuCash is based on?
I'm not sure how detailed you need right now; I helped with some of the
initial implementations but I'm sure it's all been rewritten by now.  The

No, I don't think so. The basic algorithm seems to be unchanged, since it was implemented the first time by you with commit b2ccbf6 in 2003. Yes, the code was moved between files and files have been renamed. Also the code was restructured in commit b3667c76f "Implement flat bayes kvp" and there were lots of smaller code changes over the time. But the algorithm still seems to be the same.

idea is that the description/memo strings are tokenized and used as inputs
into the probabilities that the transaction would go into the target
account.  If you have a high-enough probability it will auto-select that
account for that transaction.

When you assign an account (during import), it adds those tokens to the
account's list of tokens for future guessing.
The main principle is clear to me.

Did you have a specific question about the process?  For the complete
algorithm you can look at the code.  It's really not all that complicated
(or at least it wasn't when first implemented).

Ok, I try to be more precise.

I. The implemented algorithm I read from the code is the following:

                                product_k
P(A_k | T_1, ..., T_N) = --------------------------
                         product_k + product_diff_k
with

product_k      =      P(A_k | T_1)  * ... *      P(A_k | T_N)
product_diff_k = [1 - P(A_k | T_1)] * ... * [1 - P(A_k | T_N)]

and with

P(A_k | T_n)           ... probability of account A_k under the
                           condition of token T_n
P(A_k | T_1, ..., T_N) ... probability of account A_k under the
                           condition of a list of tokens T_1, ..., T_N
N                      ... number of tokens


II. By applying the formulas of the Bayes classifier (see e.g. Wikipedia) I get the following:

                                 product_k
P(A_k | T_1, ..., T_N) = ---------------------------
                         product_1 + ... + product_K
with

product_k = P(T_1 | A_k) * ... * P(T_N | A_k)

and with

P(T_n | A_k) ... probability of token T_n under the condition
                 of account A_k
K            ... number of accounts
N            ... number of tokens

assuming equal total probability P(A_k) for all accounts.


III. Simplifying these formulas by only distinguishing between account A_k and all other accounts leads to the following formulas:

                                product_k
P(A_k | T_1, ..., T_N) = -------------------------
                         product_k + product_not_k
with

product_k     = P(T_1 |  A_k) * ... * P(T_N |  A_k)
product_not_k = P(T_1 | ¬A_k) * ... * P(T_N | ¬A_k)

and with

P(T_n |  A_k) ... probability of token T_n under the condition
                  of account A_k
P(T_n | ¬A_k) ... probability of Token T_n under the condition
                  of any other account than A_k
K             ... number of accounts
N             ... number of tokens

assuming equal total probability for P(A_k) and P(¬A_k).


As you can see neither the formulas in derivation II nor the formulas in derivation III fit to formulas in derivation I.

One obvious difference is, that the implemented code uses probabilities P(A_k | T_n), whereas the formulas derived from bayes classification use P(T_n | A_k).


Regards,
Christian
-derek
_______________________________________________
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Reply via email to