> On Monday 10 February 2003 05:59 pm, Christopher Browne wrote: > > >> I had a plan for a matching heuristic, but I think the bayesian > > >> filter is a better idea. Any hard coded heuristic will work well for > > >> some people but fail completely for others. A bayesian filter should > > >> adapt to various systems with different data formats much better. > > > > > > I haven't looked at ifilter, so I don't know what kind of data-store > > > it requires. I still want to maintain a data store rather than trying > > > to build it at runtime. I just have no clue about what would need to > > > be stored in such a database. > > > > At first thought, ifilter sounded insane, but at second thought, it may > > not be /too/ horribly bad. > > > > What it does is to store statistics on the incidences of particular > > words in association with each "mail folder." > > > > So, for email, we'd have a list of folders: > > > > - Spam = 0, GnuCash = 1, PostgreSQL = 2, and Family = 3 > > > > Then there are words that occur in email. We collect stats on words, in > > association with the folders. > > > > Hot --> Spam:100 GnuCash:0 PostgreSQL:5 Family:10 > > Sex --> Spam:250 GnuCash:0 PostgreSQL:0 Family:1 > > David --> Spam:0 GnuCash:0 PostgreSQL:0 Family:45 > > Carla --> Spam:0 GnuCash:0 PostgreSQL:0 Family:30 > > QIF --> Spam:1 GnuCash:40 PostgreSQL:0 Family:0 > > SQL --> Spam:2 GnuCash:8 PostgreSQL:48 Family:0 > > > > There would presumably be a whole lot more words than that. > > > > We then look at a new message, and look at what words it has in it. For > > each word, compare the stats with each of the message folders. > > > > One that has "David" and "Carla" in it will likely show a strong > > association with "Family", and none with any of the other folders. > > > > One that has the words "Hot" and "Sex" will likely have weak correlation > > with PostgreSQL and Family, and show strong association with Spam. > > > > There's a log-based weighting done so that /all/ the words get taken > > jointly into consideration based on their various weightings. > > > > On the one hand, it seems an attractive idea to try to do something like > > this with GnuCash; this should provide a nice way of adaptively having > > incoming transactions categorize themselves to the "nearest" existing > > transaction. > > > > Yep, this is the plan based upon Mr.Graham's webpage on spam filtering.
No, it is most certainly not. It is based on /my/ web page on spam filtering, which was in place several years before Mr. Graham ever considered the idea. I presented a talk on this back in the mid-90s, and the code I'm using has been pretty solid since about 1997. > > Three challenges/problems leap to mind: > > > > 1. Ifilter performance gets increasingly /greatly/ ugly as the number > > of categories grows. Doing this totally automagically means that each > > transaction is a "category," and if there are thousands of transactions, > > that's not terribly nice. > > > > 2. Related to 1, what if the transaction "looks nearly like" a dozen > > transactions? Which do you choose? > The match isn't based on a single transaction but rather the sum of the > transactions in any given destination account. A patch should be forthcoming > in a few days that implements the whole algorithm. Including the logarithmic normalization associated with Bayesian Filtering? (Graham's method oversimplifies it...) > > 3. What if a bunch of transactions are already very similar? For > > instance, my monthly rent goes in to the same payee for the same amount > > on the same day of each month. These transactions are 'essentially the > > same' for our purposes, and really should get collected together into > > one "category." It would sure be nice to collect them together; that > > cuts down on the number of categories, and means that rather than there > > being a bunch of "nearly similar" categories, one for each transaction, > > that there's just one category. > There is nothing that can be done with these transactions as they contain > little or no information regarding what they are for. Unfortunately they > must be sorted with what little information they provide. Actually, since the point is to figure out the destination category, the fact that the similar transactions /are/ in the same category means that they will all strengthen the association. > > But how do we provide a "user interface" that allows them to be so > > grouped??? > > > > Solve #3 and the other two fall into insignificance. > > > > Jumping on to #5 (there is NO #4!)... > > > > 5. Of course, give me the ability to "memorize" a transaction and have > > it repeat each month and I may not even /want/ to import such > > transactions anymore... > What does this have to do with importing transactions? ;-) Someone was complaining today at the office about the fact that their bank only supplied OFX files, and that they weren't sure what would load it. I suggested that scheduled transactions would be pretty helpful in handling this even if the GnuCash OFX support /didn't/ exist... There's certainly more than one way to skin a cat... -- (reverse (concatenate 'string "gro.gultn@" "enworbbc")) http://www.ntlug.org/~cbbrowne/ifilter.html Whatever you do don't mail me at [EMAIL PROTECTED], because then I'll know you're just an address-harvester, and blacklist your IP until the end of time _______________________________________________ gnucash-devel mailing list [EMAIL PROTECTED] http://www.gnucash.org/cgi-bin/mailman/listinfo/gnucash-devel
