Re: [GNC-dev] Understanding the bayesian import matching algorithm

Christian Gruber Sun, 11 Oct 2020 15:40:32 -0700

Hi Derek,

I further investigated the implemented algorithm during the last monthsusing the transactions from my personal account. I wrote a simpleapplication which simulates the import of all my transactions one by onein chronological order. For each transaction this application comparesthe matched account calculated by the implemented bayes algorithm withthe account I manually assigned the transaction to. In case of amismatch the application provides several debug information like thecalculated probabilities for each account and the frequencies of eachtoken for instance to see what's going wrong.


What I found out was very surprising to me. Here are my results:

* the calculated probabilities for each account P(A_k | T_1, ..., T_N) are
  always whether exactly 1 or almost 0, i.e. the matched account is actually a
  binary decision
* sometimes there is more than one account with probability 1 and the finally
  selected matching account (function highest_probability()) can be the
  correct one or not, it's random
* the condition for a probability of 1 is, that there exists a token, which is
  unique for a single account, i.e. all transactions containing this token have
  always been assigned to the same account
* if there are several tokens, which are unique for different accounts, all
  these accounts have probability 1
* if there is no token, which is unique to one account, the probability is
  almost 0 for all accounts

That means, as soon as there is no unique token, which can be used forclassification, the implemented algorithm fails. Moreover if therelevant tokens are not unique, but there are unique tokens, which areactually irrelevant, these irrelevant tokens are used forclassification. This was the most common observed case for wrong accountselection, since the irrelevant tokens where mostly not assigned to thecorrect account. In many cases these irrelevant tokens had frequenciesof 1, i.e. there was only one transaction before with randomly the sametoken.

In addition this would mean, that the bayes classifier is actuallyneedless, since the same classification results can be achieved with amuch simpler rule-based decision algorithm and without any probalisticapproach.

These observations encourage my suspicion, that something is wrong withthe implemented algorithm.


Christian


Am 06.07.20 um 01:05 schrieb Christian Gruber:


Am 02.07.20 um 21:28 schrieb Derek Atkins:

Hi,

On Thu, July 2, 2020 3:10 pm, Christian Gruber wrote:

Hi,

while further studying the bayesian import matching algorithm I'mnow at

the point, where I wanted to understand, how the bayes formula is
applied to the problem of matching transactions to accounts using
tokens. But I need further information, since it doesn't come clear to
me what is really calculated there.

The implementation can be found in the following functions inAccount.cpp:


   * get_first_pass_probabilities()
   * build_probabilities()
   * highest_probability()

Actually, the latter could be omitted as it only selects the account
with the highest matching probability.

Studying the code and the rare comments on the implementation it seems
to be a variant of the naive bayes classifier

<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Probabilistic_model>

with the tokens used as (independent) "features" and the accounts used
as "classes". But comparing this algorithm to the code leaves several
questions open.

Does anybody know a more precise algorithm description, on which the
implementation in GnuCash is based on?

I'm not sure how detailed you need right now; I helped with some of the

initial implementations but I'm sure it's all been rewritten by now. The

No, I don't think so. The basic algorithm seems to be unchanged, sinceit was implemented the first time by you with commit b2ccbf6 in 2003.Yes, the code was moved between files and files have been renamed.Also the code was restructured in commit b3667c76f "Implement flatbayes kvp" and there were lots of smaller code changes over the time.But the algorithm still seems to be the same.

idea is that the description/memo strings are tokenized and used asinputs

into the probabilities that the transaction would go into the target
account.  If you have a high-enough probability it will auto-select that
account for that transaction.

When you assign an account (during import), it adds those tokens to the
account's list of tokens for future guessing.

The main principle is clear to me.

Did you have a specific question about the process?  For the complete
algorithm you can look at the code. It's really not all thatcomplicated
(or at least it wasn't when first implemented).


Ok, I try to be more precise.

I. The implemented algorithm I read from the code is the following:

                                product_k
P(A_k | T_1, ..., T_N) = --------------------------
                         product_k + product_diff_k
with

product_k      =      P(A_k | T_1)  * ... *      P(A_k | T_N)
product_diff_k = [1 - P(A_k | T_1)] * ... * [1 - P(A_k | T_N)]

and with

P(A_k | T_n)           ... probability of account A_k under the
                           condition of token T_n
P(A_k | T_1, ..., T_N) ... probability of account A_k under the
                           condition of a list of tokens T_1, ..., T_N
N                      ... number of tokens

II. By applying the formulas of the Bayes classifier (see e.g.Wikipedia) I get the following:


                                 product_k
P(A_k | T_1, ..., T_N) = ---------------------------
                         product_1 + ... + product_K
with

product_k = P(T_1 | A_k) * ... * P(T_N | A_k)

and with

P(T_n | A_k) ... probability of token T_n under the condition
                 of account A_k
K            ... number of accounts
N            ... number of tokens

assuming equal total probability P(A_k) for all accounts.

III. Simplifying these formulas by only distinguishing between accountA_k and all other accounts leads to the following formulas:


                                product_k
P(A_k | T_1, ..., T_N) = -------------------------
                         product_k + product_not_k
with

product_k     = P(T_1 |  A_k) * ... * P(T_N |  A_k)
product_not_k = P(T_1 | ¬A_k) * ... * P(T_N | ¬A_k)

and with

P(T_n |  A_k) ... probability of token T_n under the condition
                  of account A_k
P(T_n | ¬A_k) ... probability of Token T_n under the condition
                  of any other account than A_k
K             ... number of accounts
N             ... number of tokens

assuming equal total probability for P(A_k) and P(¬A_k).

As you can see neither the formulas in derivation II nor the formulasin derivation III fit to formulas in derivation I.

One obvious difference is, that the implemented code usesprobabilities P(A_k | T_n), whereas the formulas derived from bayesclassification use P(T_n | A_k).

Regards,
Christian

-derek

_______________________________________________
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Understanding the bayesian import matching algorithm

Reply via email to