Re: [GNC-dev] Understanding the bayesian import matching algorithm

2020-10-11 Thread Christian Gruber

Hi Derek,

I further investigated the implemented algorithm during the last months 
using the transactions from my personal account. I wrote a simple 
application which simulates the import of all my transactions one by one 
in chronological order. For each transaction this application compares 
the matched account calculated by the implemented bayes algorithm with 
the account I manually assigned the transaction to. In case of a 
mismatch the application provides several debug information like the 
calculated probabilities for each account and the frequencies of each 
token for instance to see what's going wrong.


What I found out was very surprising to me. Here are my results:

* the calculated probabilities for each account P(A_k | T_1, ..., T_N) are
  always whether exactly 1 or almost 0, i.e. the matched account is actually a
  binary decision
* sometimes there is more than one account with probability 1 and the finally
  selected matching account (function highest_probability()) can be the
  correct one or not, it's random
* the condition for a probability of 1 is, that there exists a token, which is
  unique for a single account, i.e. all transactions containing this token have
  always been assigned to the same account
* if there are several tokens, which are unique for different accounts, all
  these accounts have probability 1
* if there is no token, which is unique to one account, the probability is
  almost 0 for all accounts

That means, as soon as there is no unique token, which can be used for 
classification, the implemented algorithm fails. Moreover if the 
relevant tokens are not unique, but there are unique tokens, which are 
actually irrelevant, these irrelevant tokens are used for 
classification. This was the most common observed case for wrong account 
selection, since the irrelevant tokens where mostly not assigned to the 
correct account. In many cases these irrelevant tokens had frequencies 
of 1, i.e. there was only one transaction before with randomly the same 
token.


In addition this would mean, that the bayes classifier is actually 
needless, since the same classification results can be achieved with a 
much simpler rule-based decision algorithm and without any probalistic 
approach.


These observations encourage my suspicion, that something is wrong with 
the implemented algorithm.


Christian


Am 06.07.20 um 01:05 schrieb Christian Gruber:


Am 02.07.20 um 21:28 schrieb Derek Atkins:

Hi,

On Thu, July 2, 2020 3:10 pm, Christian Gruber wrote:

Hi,

while further studying the bayesian import matching algorithm I'm 
now at

the point, where I wanted to understand, how the bayes formula is
applied to the problem of matching transactions to accounts using
tokens. But I need further information, since it doesn't come clear to
me what is really calculated there.

The implementation can be found in the following functions in 
Account.cpp:


   * get_first_pass_probabilities()
   * build_probabilities()
   * highest_probability()

Actually, the latter could be omitted as it only selects the account
with the highest matching probability.

Studying the code and the rare comments on the implementation it seems
to be a variant of the naive bayes classifier
 


with the tokens used as (independent) "features" and the accounts used
as "classes". But comparing this algorithm to the code leaves several
questions open.

Does anybody know a more precise algorithm description, on which the
implementation in GnuCash is based on?

I'm not sure how detailed you need right now; I helped with some of the
initial implementations but I'm sure it's all been rewritten by now.  
The


No, I don't think so. The basic algorithm seems to be unchanged, since 
it was implemented the first time by you with commit b2ccbf6 in 2003. 
Yes, the code was moved between files and files have been renamed. 
Also the code was restructured in commit b3667c76f "Implement flat 
bayes kvp" and there were lots of smaller code changes over the time. 
But the algorithm still seems to be the same.


idea is that the description/memo strings are tokenized and used as 
inputs

into the probabilities that the transaction would go into the target
account.  If you have a high-enough probability it will auto-select that
account for that transaction.

When you assign an account (during import), it adds those tokens to the
account's list of tokens for future guessing.

The main principle is clear to me.


Did you have a specific question about the process?  For the complete
algorithm you can look at the code.  It's really not all that 
complicated

(or at least it wasn't when first implemented).


Ok, I try to be more precise.

I. The implemented algorithm I read from the code is the following:

    product_k
P(A_k | T_1, ..., T_N) = --
 product_k + product_diff_k
with

product_k   

Re: [GNC-dev] Account > Auto-Clear

2020-10-11 Thread John Ralls



> On Oct 11, 2020, at 6:32 AM, Christopher Lam  
> wrote:
> 
> 1) number of uncleared splits too high -- due to NP-hard problem, I find
> the limit around 20 uncleared splits

That's way too low a value--only 256K permutations--for it to be an NP Hard 
limitation. Far more likely it's down to a poorly selected algorithm; it's hard 
to imagine a worse choice than the provided brute-force combination of GLists 
and a GHashTable. https://www.geeksforgeeks.org/0-1-knapsack-problem-dp-10/ has 
3 better algorithms.

Regards,
John Ralls

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


[GNC-dev] Account > Auto-Clear

2020-10-11 Thread Christopher Lam
Dear Devel,

After using GnuCash for years, I've recently discovered the interesting
Account Auto-clear facility. Consider a well-used bank account, with
numerous Reconciled "r" splits, and some recent Unreconciled "n" splits
whose amounts are as follows:

Reconciled Balance = 1000.00
Unreconciled amounts -10.00, 2500.00, -33.30, -45.00

The user checks the online bank statement, finds the current balance
currently $3456.70, corresponding to the first 3 splits being processed by
the bank, and the last transaction pending. Auto-clear will accept 3456.70
and will set relevant splits to "c" cleared. But It cannot handle the
following scenarios:
1) number of uncleared splits too high -- due to NP-hard problem, I find
the limit around 20 uncleared splits
2) more than 1 possible solution eg.
Reconciled balance 1000.00
Unreconciled amounts -10.00, 2500.00, -33.30, -10.00, -45.00

If auto-clear balance is 3456.70, there are now TWO possibilities
corresponding to the two distinct -10.00 splits. It'll abort. If auto-clear
balance is 3500.00 then it will succeed because there's only one solution
(the 2500.00 split).

I have a pending PR to optimize this functions, as well as plug some memory
leaks at https://github.com/Gnucash/gnucash/pull/797; I'll push to my beta
branch and hopefully some beta-testers will confirm that the function still
works well.

Beta testers may use my PR or the flatpak build from today onwards:
https://code.gnucash.org/builds/flatpak/christopherlam/beta/

C
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] gnucash maint: Bug 797750 - SIGSEV in swig-engine.c

2020-10-11 Thread Geert Janssens
You're welcome :)

Op zaterdag 10 oktober 2020 22:31:14 CEST schreef Frank H. Ellenberger:
> Hi Geert.
> 
> OK, so we are talking about
> https://lists.gnucash.org/pipermail/gnucash-patches/2020-October/033716.html
> .
> 
> That branch was never intended to go online. I was working on htdocs'
> beta branch and that was displayed in eclipse's git history. But
> obvisious I have then selected push from gnucash's remotes. Then
> something was working, but no result appeared in  the history.
> 
> Fixed now.
> 
> Thanks for heads up!
> Frank
> 
> Am 10.10.20 um 16:15 schrieb Geert Janssens:
> > Op zaterdag 10 oktober 2020 16:13:24 CEST schreef Geert Janssens:
> >> Op vrijdag 15 mei 2020 21:37:35 CEST schreef John Ralls:
> >>> Updatedvia  https://github.com/Gnucash/gnucash/commit/659f785c
> >>> (commit)
> >>> 
> >>>   from  https://github.com/Gnucash/gnucash/commit/4e9990dd (commit)
> >>> 
> >>> commit 659f785cb81396412e503b4d8f5fe22ceb3f39df
> >>> Author: John Ralls 
> >>> Date:   Fri May 15 12:37:24 2020 -0700
> >>> 
> >>> Bug 797750 - SIGSEV in swig-engine.c
> >> 
> >> Frank,
> >> 
> >> Did you push intentionally to the github repo ? As far as I know we don't
> >> usually push feature branches there :)
> >> 
> >> Regards,
> >> 
> >> Geert
> > 
> > And this reply was to the wrong commit :(
> > 
> > It was meant in reaction to the creation of the i18ninspector branch.
> > 
> > Regards,
> > 
> > Geert
> > ___
> > gnucash-devel mailing list
> > gnucash-devel@gnucash.org
> > https://lists.gnucash.org/mailman/listinfo/gnucash-devel


___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel