>> I had a plan for a matching heuristic, but I think the bayesian >> filter is a better idea. Any hard coded heuristic will work well for >> some people but fail completely for others. A bayesian filter should >> adapt to various systems with different data formats much better.
> I haven't looked at ifilter, so I don't know what kind of data-store > it requires. I still want to maintain a data store rather than trying > to build it at runtime. I just have no clue about what would need to > be stored in such a database. At first thought, ifilter sounded insane, but at second thought, it may not be /too/ horribly bad. What it does is to store statistics on the incidences of particular words in association with each "mail folder." So, for email, we'd have a list of folders: - Spam = 0, GnuCash = 1, PostgreSQL = 2, and Family = 3 Then there are words that occur in email. We collect stats on words, in association with the folders. Hot --> Spam:100 GnuCash:0 PostgreSQL:5 Family:10 Sex --> Spam:250 GnuCash:0 PostgreSQL:0 Family:1 David --> Spam:0 GnuCash:0 PostgreSQL:0 Family:45 Carla --> Spam:0 GnuCash:0 PostgreSQL:0 Family:30 QIF --> Spam:1 GnuCash:40 PostgreSQL:0 Family:0 SQL --> Spam:2 GnuCash:8 PostgreSQL:48 Family:0 There would presumably be a whole lot more words than that. We then look at a new message, and look at what words it has in it. For each word, compare the stats with each of the message folders. One that has "David" and "Carla" in it will likely show a strong association with "Family", and none with any of the other folders. One that has the words "Hot" and "Sex" will likely have weak correlation with PostgreSQL and Family, and show strong association with Spam. There's a log-based weighting done so that /all/ the words get taken jointly into consideration based on their various weightings. On the one hand, it seems an attractive idea to try to do something like this with GnuCash; this should provide a nice way of adaptively having incoming transactions categorize themselves to the "nearest" existing transaction. Three challenges/problems leap to mind: 1. Ifilter performance gets increasingly /greatly/ ugly as the number of categories grows. Doing this totally automagically means that each transaction is a "category," and if there are thousands of transactions, that's not terribly nice. 2. Related to 1, what if the transaction "looks nearly like" a dozen transactions? Which do you choose? 3. What if a bunch of transactions are already very similar? For instance, my monthly rent goes in to the same payee for the same amount on the same day of each month. These transactions are 'essentially the same' for our purposes, and really should get collected together into one "category." It would sure be nice to collect them together; that cuts down on the number of categories, and means that rather than there being a bunch of "nearly similar" categories, one for each transaction, that there's just one category. But how do we provide a "user interface" that allows them to be so grouped??? Solve #3 and the other two fall into insignificance. Jumping on to #5 (there is NO #4!)... 5. Of course, give me the ability to "memorize" a transaction and have it repeat each month and I may not even /want/ to import such transactions anymore... Suppose I'm generating monthly rent transactions as scheduled transactions, and doing the same with various other transactions, loading data from the bank might become totally redundant... -- (reverse (concatenate 'string "ac.notelrac.teneerf@" "454aa")) http://cbbrowne.com/info/emacs.html Frisbeetarianism: The belief that when you die, your soul goes up on the roof and gets stuck... _______________________________________________ gnucash-devel mailing list [EMAIL PROTECTED] http://www.gnucash.org/cgi-bin/mailman/listinfo/gnucash-devel
