Re: [GNC-dev] Is the import match map still required?
I created two enhancement issues on Bugzilla regarding this topic: * https://bugs.gnucash.org/show_bug.cgi?id=797778 * https://bugs.gnucash.org/show_bug.cgi?id=797779 Am 30.05.20 um 14:37 schrieb Christian Gruber: David, thanks for your detailed explanations. Implementing a procedure, which could be run as needed and which updates the frequency table according to the current transactions for an account, seems to be a meaningful first step. This could be used to measure performance next. Then it could be decided, if this procedure can also run on the fly. I also thought more about the user's part of interaction with the frequency table. The current situation seems to be a bit like "hacking" the frequency table to achieve better matching results. You can remove some entries, which seem to be wrong or seem to corrupt the matching results or whatever. If the user would not change the frequency table directly, but could instead set some personal preferences on how the data is used, this would solve the problem, that these preferences are not influenced by running the procedure updating the frequency table. And by regular updates of the frequency table, wrong or outdated entries are removed reliably and the data is up-to-date. The user could for example exclude some tokens from the bayesian algorithm, which are not relevant for him. Christian Am 25.05.20 um 01:13 schrieb David Cousens: Christian, I haven't experimented to know whether constructing the frequency table on the fly creates a performance bottleneck or not but am guessing the original developer thought it might. It would require a detailed look at the code involved but my suspicion would be that the performance penalty is likely to be significant. My comment about bloat is that at present data is only maintained for accounts you specifically import data into and if that data is stored. If it isn't then bloat doesn't apply obviously. Any sort of generalized procedure could allow selection of accounts for which Bayesian matching is required, i.e. those for which importing is used to input data. My initial thought was that you would run it for all accounts but it is really only necessary for the specific subset of accounts into which you import data. It would then require the ability to run the procedure on an account if it occurred in import data but didn't have existing account matching data. If it is on the fly then no problem it can run whenever a new account being imported into appears in the imported data. The most common use case is probably importing data to one specific account but GnuCash is also able to specify the account being imported into in the import data itself. I haven't looked at how the frequency table is currently stored in memory but I am guessing it is constructed in memory when the data file is read in. The up-to-date aspect is one advantage and if the current procedure is changed to improve performance then that is not hampered by the presence of historical data which would be updated automatically when the procedure is run. If the table is stored as it is at present and a procedure was available to trawl the current transactions for an account then it can be kept up to date by running that procedure periodically. the use of data from manually entered transactions would then be incorporated whether on the fly or just run as required. Having a standalone procedure to trawl an existing file to update the stored data for an account would allow exploration of whether this is likely to be a significant performance hit if it was run on the fly so that could perhaps be a first step. The core part of the code to store the data has to exist in the matcher code already and it will be a case of wrapping this in a loop through the transactions existing in an account and setting up the gui interface to select accounts to run on. The problem with pruning the data is that GnuCash has no way of knowing apriori which tokens are most relevant. I would think that date information is not really relevant and amount/value information does little in most cases to identify a transfer account. The main difficulty I have with transfer account assignment is that some regular transactions use a unique code in the description each time they occur with no separate unique identifier of the transaction source. My wife and I both have separte gym membership subscriptions and the transaction descriptions neither identify the gym or for which of us the transaction applies. Options are to persuade the source to include specific data or only use a single account to record both but I like to track both our individual and joint expenses Some regular transactions also get matched to previous payments in the transaction matching within the date range window where the amounts and descriptions are usually identical. The current 42 day window captures both fortnightly and monthly regular
Re: [GNC-dev] Is the import match map still required?
David, thanks for your detailed explanations. Implementing a procedure, which could be run as needed and which updates the frequency table according to the current transactions for an account, seems to be a meaningful first step. This could be used to measure performance next. Then it could be decided, if this procedure can also run on the fly. I also thought more about the user's part of interaction with the frequency table. The current situation seems to be a bit like "hacking" the frequency table to achieve better matching results. You can remove some entries, which seem to be wrong or seem to corrupt the matching results or whatever. If the user would not change the frequency table directly, but could instead set some personal preferences on how the data is used, this would solve the problem, that these preferences are not influenced by running the procedure updating the frequency table. And by regular updates of the frequency table, wrong or outdated entries are removed reliably and the data is up-to-date. The user could for example exclude some tokens from the bayesian algorithm, which are not relevant for him. Christian Am 25.05.20 um 01:13 schrieb David Cousens: Christian, I haven't experimented to know whether constructing the frequency table on the fly creates a performance bottleneck or not but am guessing the original developer thought it might. It would require a detailed look at the code involved but my suspicion would be that the performance penalty is likely to be significant. My comment about bloat is that at present data is only maintained for accounts you specifically import data into and if that data is stored. If it isn't then bloat doesn't apply obviously. Any sort of generalized procedure could allow selection of accounts for which Bayesian matching is required, i.e. those for which importing is used to input data. My initial thought was that you would run it for all accounts but it is really only necessary for the specific subset of accounts into which you import data. It would then require the ability to run the procedure on an account if it occurred in import data but didn't have existing account matching data. If it is on the fly then no problem it can run whenever a new account being imported into appears in the imported data. The most common use case is probably importing data to one specific account but GnuCash is also able to specify the account being imported into in the import data itself. I haven't looked at how the frequency table is currently stored in memory but I am guessing it is constructed in memory when the data file is read in. The up-to-date aspect is one advantage and if the current procedure is changed to improve performance then that is not hampered by the presence of historical data which would be updated automatically when the procedure is run. If the table is stored as it is at present and a procedure was available to trawl the current transactions for an account then it can be kept up to date by running that procedure periodically. the use of data from manually entered transactions would then be incorporated whether on the fly or just run as required. Having a standalone procedure to trawl an existing file to update the stored data for an account would allow exploration of whether this is likely to be a significant performance hit if it was run on the fly so that could perhaps be a first step. The core part of the code to store the data has to exist in the matcher code already and it will be a case of wrapping this in a loop through the transactions existing in an account and setting up the gui interface to select accounts to run on. The problem with pruning the data is that GnuCash has no way of knowing apriori which tokens are most relevant. I would think that date information is not really relevant and amount/value information does little in most cases to identify a transfer account. The main difficulty I have with transfer account assignment is that some regular transactions use a unique code in the description each time they occur with no separate unique identifier of the transaction source. My wife and I both have separte gym membership subscriptions and the transaction descriptions neither identify the gym or for which of us the transaction applies. Options are to persuade the source to include specific data or only use a single account to record both but I like to track both our individual and joint expenses Some regular transactions also get matched to previous payments in the transaction matching within the date range window where the amounts and descriptions are usually identical. The current 42 day window captures both fortnightly and monthly regular income transactions for example. This only affects a few transactions each month and I don't have huge numbers of transactions to process now that I have retired but that may not be the case for other users. Maybe making the date range window adjustable rather
Re: [GNC-dev] Is the import match map still required?
Message: 5 Date: Sun, 24 May 2020 15:44:48 +1000 From: flywire To: gnucash-devel Subject: [GNC-dev] Fwd: Is the import match map still required? Message-ID: Content-Type: text/plain; charset="UTF-8" The most obvious match would be to match any Transfer Accounts in the data to gnu Accounts, even if the result needs to be verified. Other comments: 1) User's rapid clicks can unintentionally select the wrong account, mapping invalid data 2) Seems there could be an opportunity for user to re-run a process to recreate map and prune the useless matches David refers to ( dates, connectors (a, and, the etc.), transaction amounts ?). With enough transactions this should be pretty good. 3) I assume the table is updated with merged accounts, ... 4) Assuming match is case sensitive, should it be optionally turned off? Hi flywire, Re matching input file accounts to GnuCash accounts: I guess this would only apply to QIF or CSV imports, but sounds like a good idea. 1) & 2) You can always run the Import Map Editor to delete bad matches. I've always thought it would be a good idea if there was a parameter for the minimum token length the Bayesian matching would consider, so I could get it to ignore silly data that in no way helps a correct match, like date separator "/", "-", dd or mm or yy or . It would also be useful but a fair but of work to have screen where you could enter string tokens to be ignored, like "Receipt", "September" etc. 3) What does that mean "I assume the table is updated with merged accounts"? If you mean when you delete an account, and elect to move all the transactions to another account, I've got no idea but would be easy enough to test. Not sure it would be worth the effort as it is easy enough to build up mapping history again. The problem is to make the mapping history useful. 4) Possibly useful Regards, Chris Good ___ gnucash-devel mailing list gnucash-devel@gnucash.org https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Re: [GNC-dev] Is the import match map still required?
Christian, I haven't experimented to know whether constructing the frequency table on the fly creates a performance bottleneck or not but am guessing the original developer thought it might. It would require a detailed look at the code involved but my suspicion would be that the performance penalty is likely to be significant. My comment about bloat is that at present data is only maintained for accounts you specifically import data into and if that data is stored. If it isn't then bloat doesn't apply obviously. Any sort of generalized procedure could allow selection of accounts for which Bayesian matching is required, i.e. those for which importing is used to input data. My initial thought was that you would run it for all accounts but it is really only necessary for the specific subset of accounts into which you import data. It would then require the ability to run the procedure on an account if it occurred in import data but didn't have existing account matching data. If it is on the fly then no problem it can run whenever a new account being imported into appears in the imported data. The most common use case is probably importing data to one specific account but GnuCash is also able to specify the account being imported into in the import data itself. I haven't looked at how the frequency table is currently stored in memory but I am guessing it is constructed in memory when the data file is read in. The up-to-date aspect is one advantage and if the current procedure is changed to improve performance then that is not hampered by the presence of historical data which would be updated automatically when the procedure is run. If the table is stored as it is at present and a procedure was available to trawl the current transactions for an account then it can be kept up to date by running that procedure periodically. the use of data from manually entered transactions would then be incorporated whether on the fly or just run as required. Having a standalone procedure to trawl an existing file to update the stored data for an account would allow exploration of whether this is likely to be a significant performance hit if it was run on the fly so that could perhaps be a first step. The core part of the code to store the data has to exist in the matcher code already and it will be a case of wrapping this in a loop through the transactions existing in an account and setting up the gui interface to select accounts to run on. The problem with pruning the data is that GnuCash has no way of knowing apriori which tokens are most relevant. I would think that date information is not really relevant and amount/value information does little in most cases to identify a transfer account. The main difficulty I have with transfer account assignment is that some regular transactions use a unique code in the description each time they occur with no separate unique identifier of the transaction source. My wife and I both have separte gym membership subscriptions and the transaction descriptions neither identify the gym or for which of us the transaction applies. Options are to persuade the source to include specific data or only use a single account to record both but I like to track both our individual and joint expenses Some regular transactions also get matched to previous payments in the transaction matching within the date range window where the amounts and descriptions are usually identical. The current 42 day window captures both fortnightly and monthly regular income transactions for example. This only affects a few transactions each month and I don't have huge numbers of transactions to process now that I have retired but that may not be the case for other users. Maybe making the date range window adjustable rather than fixed might be a cure for this. Setting it at <14 days would cure the problems I have for example, but that again would not work for everybody. I am currently committed to a bit on the documentation front so I will be unlikey to consider this for the near future in other than general terms but someone else may be willing to take it up. David - David Cousens -- Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html ___ gnucash-devel mailing list gnucash-devel@gnucash.org https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Re: [GNC-dev] Is the import match map still required?
Am 24.05.20 um 01:52 schrieb David Cousens: Christian, I guess it depends on whether there is a performance advantage in using the previously stored data for the transfer account associations over constructing the frequency table on the fly. The search for matching transactions only takes place within a narrow time window around the date of import, so it is unlikely to canvas enough transactions to be able to construct a valid frequency table from tokenized data within that window. The stored frequency table would generally contain data from a much wider range of transactions and would take much longer to construct on the fly each time it was needed. I'm only thinking about account matching (bayesian matching), not transaction matching. For this of course it would be necessary to work with all historical data, not only with a few transactions within a narrow time window. Can you tell, if it would be a considerable performance load to construct the frequency table on the fly from all historical transactions related to a transfer account? I have also pondered whether it could be usefully augmented by using data from transactions entered manually which have not been imported for the file associations. Could be of value where you have a good set of historical records but it would only need a one off run through the existing transactions to gather the data. Unless you confined it to running on a specific set of accounts to which you import data it might cause bloat of the data file with unnecessary and unused information. A possible advantage of constructing the frequency table on the fly could be, that it is always up-to-date. If the user sets the "wrong" other account during import for instance and corrects this after the import, the import match map still contains the wrong matching information at the moment and will also not be corrected after the import. Also manually entered transactions would be considered, right. A one-off manual run through all transactions to update the import match map could be a good alternative to constructing it on the fly. Sounds good. Why do you think, a run through all transactions "might cause a bloat of the data file"? The current import match map also contains all, maybe unused or unnecessary data from all matched accounts. I still assume in this case, that the import match map is related to one transfer account only, which already limits the set of accounts from which the import match map is constructed. I have examined the stored data in my data file with the import map editor and found that there was a lot of data stored which contributes little to the matching for the transfer account ( dates, connectors (a, and, the etc.), transaction amounts ?) which often have a fairly uniform frequency for all accounts which were used as transfer accounts. After a bit of pruning of the stored data my matching reliability seemed to improve a bit. Ok, I see. If the import match map has to be pruned to get reliable results from the bayesian matching algorithm, a frequency table, which is constructed on the fly or is rebuilt on a one-off run, is a big disadvantage. If it is constructed on the fly, nothing can be pruned. And if it is rebuilt, all pruned data will back after the run. I don't know at the moment if the tokens stored for transfer account matching are a subset of the tokens used for transaction matching (haven't checked) but restricting the set of tokens used may possibly improve performance and reduce the amount of data stored if all tokens associated with a transaction are currently being stored in the frequency table which is what I suspect from examining my import map data. Yes, this is the current situation, every token is stored. Do you have suggestions, how tokens could be automatically pruned in a meaningful way? David Cousens - David Cousens -- Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html ___ gnucash-devel mailing list gnucash-devel@gnucash.org https://lists.gnucash.org/mailman/listinfo/gnucash-devel ___ gnucash-devel mailing list gnucash-devel@gnucash.org https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Re: [GNC-dev] Is the import match map still required?
4) Assuming match is case sensitive, should it be optionally turned off? ___ gnucash-devel mailing list gnucash-devel@gnucash.org https://lists.gnucash.org/mailman/listinfo/gnucash-devel
Re: [GNC-dev] Is the import match map still required?
Christian, I guess it depends on whether there is a performance advantage in using the previously stored data for the transfer account associations over constructing the frequency table on the fly. The search for matching transactions only takes place within a narrow time window around the date of import, so it is unlikely to canvas enough transactions to be able to construct a valid frequency table from tokenized data within that window. The stored frequency table would generally contain data from a much wider range of transactions and would take much longer to construct on the fly each time it was needed. I have also pondered whether it could be usefully augmented by using data from transactions entered manually which have not been imported for the file associations. Could be of value where you have a good set of historical records but it would only need a one off run through the existing transactions to gather the data. Unless you confined it to running on a specific set of accounts to which you import data it might cause bloat of the data file with unnecessary and unused information. I have examined the stored data in my data file with the import map editor and found that there was a lot of data stored which contributes little to the matching for the transfer account ( dates, connectors (a, and, the etc.), transaction amounts ?) which often have a fairly uniform frequency for all accounts which were used as transfer accounts. After a bit of pruning of the stored data my matching reliability seemed to improve a bit. I don't know at the moment if the tokens stored for transfer account matching are a subset of the tokens used for transaction matching (haven't checked) but restricting the set of tokens used may possibly improve performance and reduce the amount of data stored if all tokens associated with a transaction are currently being stored in the frequency table which is what I suspect from examining my import map data. David Cousens - David Cousens -- Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html ___ gnucash-devel mailing list gnucash-devel@gnucash.org https://lists.gnucash.org/mailman/listinfo/gnucash-devel
[GNC-dev] Is the import match map still required?
Hi devs, Meanwhile, I'm studying the bayesian import matching algorithm quite intensively. There is one question, I was often asking myself and which I want to ask you now. The same question was also asked in the german mailing list a few days ago. Is it really necessary to work with a separate "import match map" instead of the imported transaction data directly? The import match map contains matching information from earlier imports, but do the imported transactions not contain the same information? I mean instead of querying token occurrences in this import match map for instance, couldn't the bayesian algorithm get the same information by querying the already imported transactions? Is this a performance consideration or are there further reasons? Regards, Christian ___ gnucash-devel mailing list gnucash-devel@gnucash.org https://lists.gnucash.org/mailman/listinfo/gnucash-devel