[GNC-dev] New bayesian account matching algorithm

Christian Gruber Tue, 15 Dec 2020 15:01:23 -0800

Hi devs,

I'd like to propose a new algorithm for matching imported transactionsto accounts. I've been dealing with the bayesian account matcher forseveral months now during my spare free time. Starting point of myefforts was the observation, that sometimes matching fails, where Iwouldn't expect that. For instance some recurrent transactions, whichcould be matched successfully for a long time, suddenly started to failmatching. Additionally several users in the german mailing list reportedsimilar problems and asked for help on how to "improve" the matchingresults. The mostly recommended workaround was to "clean up" the importmapping table using the corresponding import map editor.

Therefore I started trying to understand the current implementation. Seemy mails from that thread:

*http://gnucash.1415818.n4.nabble.com/GNC-dev-Understanding-the-bayesian-import-matching-algorithm-td4719812.html



I could fix some bugs already. See this issue on Bugzilla:


   * https://bugs.gnucash.org/show_bug.cgi?id=797587


Another issue is still open:


   * https://bugs.gnucash.org/show_bug.cgi?id=797744

Unfortunatelly I was not able to fully understand the implementation,see my mails from july in the mail thread mentioned above. Moreover itseemed, that nobody else could explain the current implementation anymore.

Therefore I stopped trying to further understand the algorithm andprepared a new and improved approach. I did a little research and trieddifferent approaches. The final approach, I want to propose now, is aprobalistic (bayesian) approach as well and should solve the followingtwo problems of the current algorithm:



   * it should be possible to understand and explain the matching results

* the account matching probabilities should be "real" probabilitiesand not only result to 1 or 0 (see my last mail from october in the mailthread mentioned above)

This new algorithm is even simpler, but still as effective as thecurrent algorithm. And the current algorithm can be easily replacedusing the existing import mapping table (token frequency table). It isprovided as PR #839 on Github:



   * https://github.com/Gnucash/gnucash/pull/839

I tested the algorithm with the transactions from my personal account. Iwrote a simple test application, which simulates the import of all mytransactions one by one in chronological order. For each transactionthis application compares the matched account calculated by thealgorithm with the account I manually assigned the transaction to. Iused this application to compare the new proposed algorithm with thecurrent algorithm.

At first all transactions, which did match correctly before still matchcorrectly using the new algorithm. Additionally several transactions,which did not match correcly before, match correctly now. The ratio ofcorrect matches could be improved from approx. 60 % to 70 %.

But the most important improvement is, that false matching results canbe explained now.

And this is a first draft of the new approach only. There is still spacefor further improvement. The algorithm has a new feature. You can selector exclude individual tokens from being considered for calculation ofaccount matching probability. I already proposed this idea on Bugzilla



   * https://bugs.gnucash.org/show_bug.cgi?id=797779

From my point of view this is the key for further improvement. Thecurrent as well as the new proposed algorithm can solve all the uniquecases, i.e. transactions with have at least one token, which is uniqueto exactly one account (see also my last mail from october in the mailthread mentioned above). The more complicated cases are transactions,which can not be matched uniquely to one account, since similartransactions have been assigned to different accounts in the past by theuser.

I played around with these transactions and could already achievepromising results by reducing the 'token_account_probability_threshold'used for automatic token selection. But this raises new problems, sincethere are some tokens, which can distract the matching result from thecorrect account. Furthermore it wasn't that easy anymore to find ameaningful threshold to detect "new" transactions, which haven't occuredbefore and shouldn't be tried to match.

Therefore I decided for a very conservative threshold in the first draftand left this open for future improvements. I prefered a robust solutionat the moment.

Ok, now you can have a look and try out the new approach on your own.And I'm curious about your feedback.



Christian



_______________________________________________
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

[GNC-dev] New bayesian account matching algorithm

Reply via email to