Hi devs,

I'd like to propose a new algorithm for matching imported transactions to accounts. I've been dealing with the bayesian account matcher for several months now during my spare free time. Starting point of my efforts was the observation, that sometimes matching fails, where I wouldn't expect that. For instance some recurrent transactions, which could be matched successfully for a long time, suddenly started to fail matching. Additionally several users in the german mailing list reported similar problems and asked for help on how to "improve" the matching results. The mostly recommended workaround was to "clean up" the import mapping table using the corresponding import map editor.


Therefore I started trying to understand the current implementation. See my mails from that thread:


   * http://gnucash.1415818.n4.nabble.com/GNC-dev-Understanding-the-bayesian-import-matching-algorithm-td4719812.html


I could fix some bugs already. See this issue on Bugzilla:


   * https://bugs.gnucash.org/show_bug.cgi?id=797587


Another issue is still open:


   * https://bugs.gnucash.org/show_bug.cgi?id=797744


Unfortunatelly I was not able to fully understand the implementation, see my mails from july in the mail thread mentioned above. Moreover it seemed, that nobody else could explain the current implementation anymore.


Therefore I stopped trying to further understand the algorithm and prepared a new and improved approach. I did a little research and tried different approaches. The final approach, I want to propose now, is a probalistic (bayesian) approach as well and should solve the following two problems of the current algorithm:


   * it should be possible to understand and explain the matching results

   * the account matching probabilities should be "real" probabilities and not only result to 1 or 0 (see my last mail from october in the mail thread mentioned above)


This new algorithm is even simpler, but still as effective as the current algorithm. And the current algorithm can be easily replaced using the existing import mapping table (token frequency table). It is provided as PR #839 on Github:


   * https://github.com/Gnucash/gnucash/pull/839


I tested the algorithm with the transactions from my personal account. I wrote a simple test application, which simulates the import of all my transactions one by one in chronological order. For each transaction this application compares the matched account calculated by the algorithm with the account I manually assigned the transaction to. I used this application to compare the new proposed algorithm with the current algorithm.


At first all transactions, which did match correctly before still match correctly using the new algorithm. Additionally several transactions, which did not match correcly before, match correctly now. The ratio of correct matches could be improved from approx. 60 % to 70 %.


But the most important improvement is, that false matching results can be explained now.


And this is a first draft of the new approach only. There is still space for further improvement. The algorithm has a new feature. You can select or exclude individual tokens from being considered for calculation of account matching probability. I already proposed this idea on Bugzilla


   * https://bugs.gnucash.org/show_bug.cgi?id=797779


From my point of view this is the key for further improvement. The current as well as the new proposed algorithm can solve all the unique cases, i.e. transactions with have at least one token, which is unique to exactly one account (see also my last mail from october in the mail thread mentioned above). The more complicated cases are transactions, which can not be matched uniquely to one account, since similar transactions have been assigned to different accounts in the past by the user.


I played around with these transactions and could already achieve promising results by reducing the 'token_account_probability_threshold' used for automatic token selection. But this raises new problems, since there are some tokens, which can distract the matching result from the correct account. Furthermore it wasn't that easy anymore to find a meaningful threshold to detect "new" transactions, which haven't occured before and shouldn't be tried to match.


Therefore I decided for a very conservative threshold in the first draft and left this open for future improvements. I prefered a robust solution at the moment.


Ok, now you can have a look and try out the new approach on your own. And I'm curious about your feedback.


Christian



_______________________________________________
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Reply via email to