Re: [GNC-dev] avoid the brain dead import

2018-08-30 Thread Derek Atkins
Dear Wm,

On Thu, August 30, 2018 10:10 am, Wm via gnucash-devel wrote:
> On 29/08/2018 23:52, David Cousens wrote:
>
>> I think the decision about whether to import a small number of
>> transactions by hand is really one for the user and not the importer to
>> make. I would import small batches, maybe 20-30  to test the importer
>> function and ensure it was working as expected before attempting to
>> import 10k.
>
> You are missing the point entirely.
>
> The importer compares the tx being imported against *every* extant tx.

The importer compares against existing transactions to detect duplicates. 
This is done because there is absolutely no guarantee that the user wont
import the same transaction multiple times.  This can happen by accident
(importing the same file multiple times), or it could happen because the
data source provides the same data multiple times (e.g., some banks will
provide overlapping downloads).

It has to search every existing transaction because there is no way in the
underlying code not to do that.  Theoretically you should only need to
search through transactions within a relatively short time frame (say, +/-
2-3 weeks).  However there is no way to do this.  Even when you create a
QofSearch with a limited data range, it will *still* iterate through every
existing transaction.  Of course, if the date is not in range it will get
thrown out.  However by that point the damage has been done.

This issue will only get fixed when we can move GnuCash to be a true DB
app.  Then the SQL code can truly limit the search space properly.

Hopefully this explains what's going on.

-derek

-- 
   Derek Atkins 617-623-3745
   de...@ihtfp.com www.ihtfp.com
   Computer and Internet Security Consultant

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] avoid the brain dead import

2018-08-30 Thread Wm via gnucash-devel

On 29/08/2018 23:52, David Cousens wrote:


I think the decision about whether to import a small number of
transactions by hand is really one for the user and not the importer to
make. I would import small batches, maybe 20-30  to test the importer
function and ensure it was working as expected before attempting to
import 10k.


You are missing the point entirely.

The importer compares the tx being imported against *every* extant tx.

Read that twice, please.



___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] avoid the brain dead import

2018-08-29 Thread David Cousens
William,

I have experienced the importer trying to match data out of the range
of dates in the current import. It only occurred from memory when I
first changed over to version 3.0. The matcher appeared to have lost
all memory of what accounts to assign in the changeover form 2.6.
However I found after importing 1-2 months data it was functioning
normally again. I have been using the OFX importer for 3-4 years with
OFX without any significant problems.

Your point about large data files sounds valid. I havent looked at the
code for the match picker so I don't know how it works or whether it
works on the historical data to extract the information it needs to
make a choice of an accounts to assign or data to match. 

As it is a Bayesian mechanism at some point it has to examine the
existing data and construct some sort of probability table, so my guess
would be that this could be a step which is taking so long. Being able
to set a preference for a date range or period to use in constructing
the initial probability tables is probably a good idea if this is the
case. 

My experience on the changeover from 2.6 to 3.0 when it appeared to
have lost any memory of previous import assignments indicated that the
importer was constructing those tables from the data it imports and not
from the historical data, but I could be wrong.  I would expect it to
be using a Kalman filtering approach on the input data but can't be
sure until I get a good look at the code. It did attempt to match
transactions that were otherwise similar to transactions in the
previous month or two initially. I only have data going back~8 years
and have been retired for a large percentage of that so my files aren't
huge so I may not be hitting your problem if it is the case that it
does look further back.

I think the decision about whether to import a small number of
transactions by hand is really one for the user and not the importer to
make. I would import small batches, maybe 20-30  to test the importer
function and ensure it was working as expected before attempting to
import 10k.


On Wed, 2018-08-29 at 22:00 +0100, Wm via gnucash-devel wrote:
> On 25/08/2018 07:22, David Cousens wrote:
> 
> i thank David for his posting which i have read, I don't address all
> he said
> 
> > Keep trying. Tthe brain dead importer does get less brain dead with
> > repeated
> > use.
> 
> i'm not sure it does get better as implemented because 2 of the bits
> of 
> brain dead-ity are
> 
> 1. the universe against which the importer is comparing imported tx
> is 
> going to be growing so as a strategy it is doO0MED to sluggishness
> and 
> eventually not being used unless there is some limit to the universe 
> (week / month / quarter / year / decade)
> 
> 2. unless there is something better users are going to try and use
> it 
> and become more frustrated and stop using it.
> 
> 
> 
> fairly easy to think about ways of fixing 1. like "do you want the 
> importer to really, really, really compare the imported tx against
> your 
> stuff from the 1980's ?  y/N"  at the moment this is defaulting to Y 
> without asking and I don't think that makes sense.
> 
> I mean, think of inflation?  Why would one of anything in 2018 be 
> sensibly matched against the same thing 30 years ago?
> 
> There isn't even the opportunity to time limit the universe and some 
> folk have stuff going back much longer than me and have many more tx 
> than me.
> 
> fixing 2. just involves some thought about the user, almost no 
> programming.  Redundant questions for the user would be, "you are 
> importing 3 tx, you have 10K tx in your file, this could take
> fucking 
> hours, do you want to continue or just type them in by hand?  if you 
> want my advice by hand is quicker"
> 
> See?  the importer has no idea of scale, 3 tx incoming ?  I'll do it
> by 
> hand.
> 
> 
> 
> 
> 
> ___
> gnucash-devel mailing list
> gnucash-devel@gnucash.org
> https://lists.gnucash.org/mailman/listinfo/gnucash-devel
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] avoid the brain dead import

2018-08-29 Thread Wm via gnucash-devel

On 25/08/2018 07:22, David Cousens wrote:

i thank David for his posting which i have read, I don't address all he said


Keep trying. Tthe brain dead importer does get less brain dead with repeated
use.


i'm not sure it does get better as implemented because 2 of the bits of 
brain dead-ity are


1. the universe against which the importer is comparing imported tx is 
going to be growing so as a strategy it is doO0MED to sluggishness and 
eventually not being used unless there is some limit to the universe 
(week / month / quarter / year / decade)


2. unless there is something better users are going to try and use it 
and become more frustrated and stop using it.




fairly easy to think about ways of fixing 1. like "do you want the 
importer to really, really, really compare the imported tx against your 
stuff from the 1980's ?  y/N"  at the moment this is defaulting to Y 
without asking and I don't think that makes sense.


I mean, think of inflation?  Why would one of anything in 2018 be 
sensibly matched against the same thing 30 years ago?


There isn't even the opportunity to time limit the universe and some 
folk have stuff going back much longer than me and have many more tx 
than me.


fixing 2. just involves some thought about the user, almost no 
programming.  Redundant questions for the user would be, "you are 
importing 3 tx, you have 10K tx in your file, this could take fucking 
hours, do you want to continue or just type them in by hand?  if you 
want my advice by hand is quicker"


See?  the importer has no idea of scale, 3 tx incoming ?  I'll do it by 
hand.






___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] avoid the brain dead import

2018-08-25 Thread David Cousens
William

I think the answer to your question lies in the fact that files users wish
to import don't come from a single source and don't always conform to any
well defined standard with regard to both the data format and the
information supplied.  

Importing OFX data is considerably more straightforward than importing CSV
data for this reason, as it does conform to a reasonably well defined
standard, but even then some institutions do manage to stuff it up. In most
cases users don't necessarily have any control over what another institution
includes in the files they supply. Most include transactions between the To
and From dates inclusively that you might enter when requesting a data
download but this is not guaranteed. Stupid ? Yes, but the importer has to
cope with stupid, as well as nicely well formatted and thought out files.
Not all data for a bank account includes the detail of which account you may
want the second split of a transaction to go to and even if they do it may
not match your choices in setting up your chart of accounts. If it does then
GnuCash deals with that

I am an accountant (retired) and I have imported the same file (or at least
overlapping data in different files) on more than one occasion since I have
been using GnuCash. The point of the matcher is to pick this up before you
have enterd the data into your accounts and not have to deal with the far
more laborious task of working out which transactions were duplicated in an
import and deleting them from your records one by one once they have been
imported. If you get the date format wrong relative to your locale format on
an import, it can be particularly difficult. Swapping days and years
produces some interesting results.

The matcher also has a Bayesian learning system which can allocate the
transfer account for the second split on the basis of matching information
in the description and other fields. My experience has been after I have
imported one  or two month's data, it will generally assign the transfer
account for about 60% of data in the succeeding months and handles regular
payments and deposits pretty well and it gets better still after a few
months. 

I import  a few hundred transactions a month, generally in 5-10 minutes from
OFX files with no problem. CSV importing (e.g. Paypal can be far more
problematical but the ability of the importer in v3.2 to save import
settings is a great help.

There is a recent patch (Bug 796778) which might help you shorten the
initial input before the matcher works efficiently but it is not yet
incorporated in the master branch. It implements multiple selection of rows
in the matcher e.g.. from the same vendor using Ctrl-click and Shift Click
and the rubberbanding techniques implemented in GTK and the assignment of
those rows to a single transfer account. It speeds up the initial import of
data quite a bit but is less effective once the Bayesian matching is trained
(which is possibly why it has not been implemented before now) as that tends
to pick up repeated transactions fairly well. 

The downside is of course there is always a transaction or two  from the
same vendor or customer which may have to go to a different transfer
account, i.e. you still have to check that it has been correctly assigned by
the matcher.

Keep trying. Tthe brain dead importer does get less brain dead with repeated
use.

David Cousens



-
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel