subject:"Re\: \[GNC\-dev\] Normalizing live data, a suggestion for discussion"

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-23 Thread Wm via gnucash-devel


On 02/02/2019 23:05, David Cousens wrote:


I don't since I retired a few years ago, but I did for 8 years prior to
retiring (and I used MYOB for the 10 years prior to that before escaping). I
am certainly not alone. You could have a proviso that the script won't work
for files using the business functions but that then detracts considerably
from its usefulness as a general diagnostic tool.


I'm respecting you more as we progress, DavidC.

The broad point is that a normalization is without opinion or value.

No person would know if you had run your business successfully or not.


The fear is "the government will know I earned 20AUD on a contract and I 
didn't report".


struth is your government has much larger issues to deal with, ask them 
to to pay attention to that.  That is, if you can manage one government 
for more than 3 fucking months at a time!


---

Point: MYOB is respected in Oz, Liz says so, it must be true.  Rest of 
the world doesn't give a flying fuck about whether it is a good double 
accounting prog or not.


---


Sqlite itself and its availability on Linux is not really an issue. Most
distros have it in their software repositories. What may be more of an issue
is that a lot of people who don't use the database backends because they
don't want the additional hassles of learning to use and maintain databases
may be reluctant to install it.


True, I think this is also a red herring, most people are using Windows 
and SQLite comes with gnc for free.


Shouldnt you be asking why more people aren't using what they already have?

I'm retired. 


Disagree, your mind is still active :)


Taking an extra half day to learn something
new doesn't worry me as long as it happens before my time is up. But if I am
running a busy lfe and/or a business as I used to, I would be more
reluctant. Again not a show stopper, only a limitation on general
applicability.

David Cousens


Have a hug.
--
Wm

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-23 Thread Wm via gnucash-devel


On 03/02/2019 04:10, David Carlson wrote:

OK, I want to try https://wiki.gnucash.org/wiki/ObfuscateScript but I am
not a computer programmer.  I have no clue how to use it.  Can someone help
me?


it is perl, if you have F::Q working you probably have enough kit to run it.

--
Wm


___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-11 Thread Wm via gnucash-devel


On 03/02/2019 16:03, John Ralls wrote:




On Feb 2, 2019, at 8:10 PM, David Carlson  wrote:

OK, I want to try https://wiki.gnucash.org/wiki/ObfuscateScript but I am
not a computer programmer.  I have no clue how to use it.  Can someone help
me?


Run it from a command line using perl, assuming here that you have Strawberry 
installed on C:

   c:\strawberry\perl\bin\perl.exe ObfuscateScript path/to/myfile.gnucash

Note that it rewrites the file in place, so make a copy and run it on that. The 
file needs to be uncompressed.


Apart from the write in place I quite like it as an idea to progress 
thought.


Positive: it is in perl which (many|most) people may have a working 
version of if they are using F::Q


Negative: it doesn't reconcile well, but this may actually be a positive 
because ...


Positive: if the script breaks some splits this should be seen as a good 
thing by some, it makes the work of the super secret agents running gnc 
harder.


Thinking aloud: another way of normalizing would be to split to some 
point beyond usefulness and let gnc put it back together again using

Actions / Check & Repair

===

Remember flox, the idea is a file that someone else (who probably didn't 
vote for the idiot Trump) could look at to see *your* problem.


Does the remote person want to see you paid USD10 for a burger meal and 
some beer then vomited on the pavement and had to pay a fine for that? 
Nope.  The remote person wants to see what the fuck you have put in your 
file that is screwing up the transaction stream.


--
Wm

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-11 Thread Wm via gnucash-devel


On 03/02/2019 02:01, David Cousens wrote:


As Geert pointed out whole of program testing is very difficult and rapidly
reaches a situation where complexity is equal to or greater than  the
program complexity and this is really what gave rise to unit testing where
you test individual components which do a specific function.


That can't fix a problem where an incorrect presumption was made in the 
first place.



One area in which an example file  rather than a test file might be useful
is in developing  the documentation. The guide section on Accounts
Transaction following through to Personal Finances
in escence constructs a simple file while doing the tutorial. Here though it
is  the process of constructing the data in the file that is useful. A
completed example file is not of great use.


I'd advise against using any file as the right file for documentation 
purposes.  There are just too many edge cases.


Something I think would be amusing rather than instructive would be to 
put all of the example tx in the docs into one file.  I doubt it would 
be useful to anyone other than an historian of finance programs but it 
would be fun to see what we ended up with.  If someone is thinking of 
presenting a paper at a conference try it, mention me if you are feeling 
generous :)



It is also likely that most problems which are likely to require this depth
of investigation are unlikely to show up in a test file unless you can
execute a series of entries in a scripted manner i.e. interact with the gui
from a script and this is not possible with GnuCash at the moment AFAIK.
The problem is usually somewhere in the process of getting to the results in
the file and what is in the file is merely a symptom of the problem.


gnc is a transaction stream application.  each time you open a file it 
starts from 0 and does addition and subtraction.  no more no less.


on top of that we have pretty stuff, convenient ways of adding new 
transactions to the stream, convenient ways of reporting the results of 
the stream.


nevertheless, it is still just a program interpreting a stream of 
transactions.


gnc is a convenience.  I don't see why I should have to give live data 
to people I don't know in person ... and I don't even have super secret 
stuff like tax havens or a Donald Trump blow job account or a religious 
belief.


I just feel uncomfortable showing ordinary tx to people I don't know, it 
is that simple to me.


Q: Why does someone need to see *my* (or your) tx to fix a problem?
A: they don't

So, we are stuck.

--
Wm

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-04 Thread Geert Janssens

Op zaterdag 2 februari 2019 22:36:18 CET schreef Wm via gnucash-devel:
> On 02/02/2019 15:24, Geert Janssens wrote:
> > As for Colin's question: on Windows and MacOS sqlite is supported out of
> > the box. On linux it may require the additional installation of a libdbi
> > driver. Most distros I know have packages for this driver but they may
> > not be installed by default.
> 
> It would be an odd distro that excluded SQLite, it is a requisite for a
> lot of other stuff like browsers.  Thinking aloud: maybe a server only
> install might not have it or someone stupid enough to put their data on
> Amazon might not have it available.  The question then becomes, why was
> the person so stupid?

Well I do understand sqlite is available by default, but gnucash requires 
libdbi with the sqlite backend (which in turn indeed uses sqlite). I haven't 
checked whether all supported distros also have that combination installed by 
default. I don't know if webbrowsers also use libdbi. I know firefox does not.

And I haven't and won't spend time to check this for all those distros.

However I do agree this should only be a small hurdle. And I understand your 
script is an optional aid for those people that would want a better privacy 
guarantee before sending their data in for analysis.

Geert

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-04 Thread Geert Janssens

Op zaterdag 2 februari 2019 22:36:18 CET schreef Wm via gnucash-devel:
> On 02/02/2019 15:24, Geert Janssens wrote:
> > Yes, if you use business features, you may have entered business
> > identifying data in File->Properties. It think that's what David is
> > referring to.
> I agree, the third party should not be identified.
> 
> > Similarly there may be customer and vendor data (names addresses) in the
> > book that should equally be obfuscated. Just random data is fine.
> 
> Yes.
> 
> Geert, at the moment I am putting guid in place of random, do you think
> that is a wrong way to approach this?
> 
I think GUIDs are probably fine as well.

Note I'm going by the theoretical goal of not being able to reconstruct the 
user's real financial data from the obfuscated file. Personally I'm not 
interested in doing that at all,  but people's paranoia levels may vary.

So talking of guids. If I remember correctly the default guids for accounts 
coming from gnucash account templates are hard-coded (or at least they used to 
be until somewhere in the 2.6 series.

So if that is still true then guid for account names is only fake obfuscation. 
And perhaps these guids should be replaced throughout the book during the 
obfuscation before replacing account names with guids

> Actually, the nearer we get to complete random the less useful the file
> becomes.  Actual random data is harder than most people think and pretty
> much defeats the purpose if you think about it.
> 
>From a human's point of view a guid is just random numbers. So I don't see how 
that makes a difference. If the same random value is used where the data was 
the same in the original book, it's just like using a guid. And I'm no talking 
of numbers for this part, I'm talking about customer names, vendor addresses, 
that kind of stuff.

> > Continuing on that vein, if you have bills and invoices, aside from
> > randomizing the transaction's split amounts and values you'll also have to
> > do the same for invoice entries.
> 
> I don't think that is true in most situations and even if what you say
> is true, I don't see it as a good argument against *attempting* a
> normalized book for most people.
> 
It's true if the bug to investigate is somewhere in the business code. In that 
case what your invoice data says should match what the resulting transactions 
say. Those are stored in different parts in the book, but are interrelated.

But even if the bug is not in business data, the business data should be 
properly anonymized or removed anyway such that the user can confidently share 
it without risking real financial or private info can be extracted from it. Of 
course in that context the business data no longer has to be consistent though 
I still believe it makes debugging harder if it isn't.

> > And to make the book useful for detecting
> > business data bugs this should happen in such a way that invoice tax and
> > discount amounts remain consistent after multiplying with random numbers
> > *and* that the invoice totals continue to match the business transactions
> > amounts in AR/AP accounts.
> 
> There will be situations that involve the person doing the triage
> needing to see actual transactions, I have already commented on that.
> 
Sure. However that's not what I'm implying here. The extra business 
requirements are an extension of your initial concept that transactions should 
continue to balance. From a business data point of view invoices with their 
entries should continue to balance with their invoice transactions or the data 
quickly becomes meaningless.

> > And to make that one level more complicated, after that the payment
> > transactions *also* have to continue to match the new randomized invoice
> > amount (if the invoice was paid in full).
> 
> U, I don't think that is true.  If the munged numbers match (and
> they will, that is what the script will do) the transaction stream will
> be OK.
> 
> It is possible I have missed your point, Geert, but I think it is
> looking like I understand the contents of the gnc files better than you :(
> 
You did miss the point. You only think of balancing transactions. I'm also 
thinking of balancing lots, a more hidden aspect of the business data that's 
crucial to debug payment issues. My next reservation was also about consistent 
lots.

> > It doesn't end there, payments can be split over multiple invoices, so
> > again when one randomizes invoice amounts care must be taken to adjust
> > the payments in proportion to the invoice amount change or fully paid
> > invoices suddenly can become partially paid or overpaid.
> 
> Not true.
> 
> Geert, I don't want to say this but I believe you are actually wrong,
> for once.

It would be more useful to explain why you think that.
> 
> > While this is probably all possible I believe the resulting script will be
> > so complex that it will become a source of bugs in itself which would
> > divert developer time to debugging and maintaining this script

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-03 Thread John Ralls

> On Feb 2, 2019, at 8:10 PM, David Carlson  wrote:
> 
> OK, I want to try https://wiki.gnucash.org/wiki/ObfuscateScript but I am
> not a computer programmer.  I have no clue how to use it.  Can someone help
> me?

Run it from a command line using perl, assuming here that you have Strawberry 
installed on C:

  c:\strawberry\perl\bin\perl.exe ObfuscateScript path/to/myfile.gnucash

Note that it rewrites the file in place, so make a copy and run it on that. The 
file needs to be uncompressed.

Regards,
John Ralls

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread David Carlson

OK, I want to try https://wiki.gnucash.org/wiki/ObfuscateScript but I am
not a computer programmer.  I have no clue how to use it.  Can someone help
me?

David C

>
>
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread David Cousens

Steve,

As Geert pointed out whole of program testing is very difficult and rapidly
reaches a situation where complexity is equal to or greater than  the
program complexity and this is really what gave rise to unit testing where
you test individual components which do a specific function.

One area in which an example file  rather than a test file might be useful
is in developing  the documentation. The guide section on Accounts
Transaction following through to Personal Finances 
in escence constructs a simple file while doing the tutorial. Here though it
is  the process of constructing the data in the file that is useful. A
completed example file is not of great use. 

It is also likely that most problems which are likely to require this depth
of investigation are unlikely to show up in a test file unless you can
execute a series of entries in a scripted manner i.e. interact with the gui
from a script and this is not possible with GnuCash at the moment AFAIK. 
The problem is usually somewhere in the process of getting to the results in
the file and what is in the file is merely a symptom of the problem.

David



-
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Frank H. Ellenberger

Hello Wm

Am 01.02.19 um 14:36 schrieb Wm via gnucash-devel:
> 
> My suggestion is we ask people to save a *copy* of their data in SQLite
> and they then run a script across that copy that munges and obfuscates
> 

Did you see https://wiki.gnucash.org/wiki/ObfuscateScript ?

It is targeting xml files and was uploaded in 2010. So it might be
slightly bit rotten.

Regards
Frank

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread David Cousens

Wm,

>> It doesn't end there, payments can be split over multiple invoices, so
>> again 
>> when one randomizes invoice amounts care must be taken to adjust the
>> payments 
>> in proportion to the invoice amount change or fully paid invoices
>> suddenly can 
>> become partially paid or overpaid. 
>
>Not true. 
>
>Geert, I don't want to say this but I believe you are actually wrong, 
>for once. 
>On 02/02/2019 15:24, Geert Janssens wrote:

In what way is what Geert says here not true? 

Payments can be split over multiple invoices. 
A single invoice could also have several payments associated with it.

These sort of situations arise frequently in small businesses where you may
need to micro manage your cash flow.

If, in the randomisation process, you do not apply the same random factor to
all the invoices covered by that payment, then what he says is exactly what
will happen. This means your script will have to detect all of the invoices
related to a payment.  OK it can be dealt with,  but again the script
complexity is increased considerably to do so.

>Most people don't use the business functions

I don't since I retired a few years ago, but I did for 8 years prior to
retiring (and I used MYOB for the 10 years prior to that before escaping). I
am certainly not alone. You could have a proviso that the script won't work
for files using the business functions but that then detracts considerably
from its usefulness as a general diagnostic tool.

Sqlite itself and its availability on Linux is not really an issue. Most
distros have it in their software repositories. What may be more of an issue
is that a lot of people who don't use the database backends because they
don't want the additional hassles of learning to use and maintain databases
may be reluctant to install it. It's not that it is all that difficult if
you're familiar with it, but if you are not, it is an an additional hurdle
and learning curve. I'm retired. Taking an extra half day to learn something
new doesn't worry me as long as it happens before my time is up. But if I am
running a busy lfe and/or a business as I used to, I would be more
reluctant. Again not a show stopper, only a limitation on general
applicability.

David Cousens

-
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Wm via gnucash-devel


On 02/02/2019 15:24, Geert Janssens wrote:


Yes, if you use business features, you may have entered business identifying
data in File->Properties. It think that's what David is referring to.


I agree, the third party should not be identified.


Similarly there may be customer and vendor data (names addresses) in the book
that should equally be obfuscated. Just random data is fine.


Yes.

Geert, at the moment I am putting guid in place of random, do you think 
that is a wrong way to approach this?


Actually, the nearer we get to complete random the less useful the file 
becomes.  Actual random data is harder than most people think and pretty 
much defeats the purpose if you think about it.



Continuing on that vein, if you have bills and invoices, aside from
randomizing the transaction's split amounts and values you'll also have to do
the same for invoice entries.


I don't think that is true in most situations and even if what you say 
is true, I don't see it as a good argument against *attempting* a 
normalized book for most people.



And to make the book useful for detecting
business data bugs this should happen in such a way that invoice tax and
discount amounts remain consistent after multiplying with random numbers *and*
that the invoice totals continue to match the business transactions amounts in
AR/AP accounts.


There will be situations that involve the person doing the triage 
needing to see actual transactions, I have already commented on that.



And to make that one level more complicated, after that the payment
transactions *also* have to continue to match the new randomized invoice
amount (if the invoice was paid in full).


U, I don't think that is true.  If the munged numbers match (and 
they will, that is what the script will do) the transaction stream will 
be OK.


It is possible I have missed your point, Geert, but I think it is 
looking like I understand the contents of the gnc files better than you :(



It doesn't end there, payments can be split over multiple invoices, so again
when one randomizes invoice amounts care must be taken to adjust the payments
in proportion to the invoice amount change or fully paid invoices suddenly can
become partially paid or overpaid.


Not true.

Geert, I don't want to say this but I believe you are actually wrong, 
for once.



While this is probably all possible I believe the resulting script will be so
complex that it will become a source of bugs in itself which would divert
developer time to debugging and maintaining this script rather than working on
the effectively reported bug for which a sample data file was asked in the
first place...


H, I accept your point and disagree.


Up until a book with only transactions, no business data at all it sounded
like a useful tool.


Be a brave man, Geert, most people don't use the business functions :)


Oh and we haven't mentioned SXs and budgets yet...


Unless they are material to the file being investigated I suggest we 
just delete all SXs and budget stuff.



As for Colin's question: on Windows and MacOS sqlite is supported out of the
box. On linux it may require the additional installation of a libdbi driver.
Most distros I know have packages for this driver but they may not be
installed by default.


It would be an odd distro that excluded SQLite, it is a requisite for a 
lot of other stuff like browsers.  Thinking aloud: maybe a server only 
install might not have it or someone stupid enough to put their data on 
Amazon might not have it available.  The question then becomes, why was 
the person so stupid?


As far as I am concerned this conversation is ongoing, if only because 
Geert says he still needs a file from me to replicate a basic problem 
that I don't think needs any data from me at all.


--
Wm


___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Wm via gnucash-devel

On 02/02/2019 16:11, Geert Janssens wrote:

But I don't know how feasible it is to effectively obfuscate that data withoug
resorting to a complex script

The script will be seen by others that do understand sql before anyone 
innocent gets to use it, promise.

If the script is well documented (I don't see the point of obfuscated 
sql when we are doing something like this as time is not the major 
issue, getting the problem fixed is) then people that can read will use it.

Further, most of the actual gnc code is so fucking obfuscated it is 
acknowledged only a handful of people can read it, so do you really want 
to raise the issue of obfuscation, Geert?

Seriously, people that don't know how code works are already trusting 
their financial data to code they have no clue about.  Why is my 
suggestion going to increase or decrease trust or increase or decrease 
complexity?

Gr.

>> that may introduce its own set of bugs

My script cannot introduce a bug, we are normalizing data <-- read that 
again, please.

or
inadvertently also obfuscate the actual issue. 

That is a possibility.  I consider this a positive not a negative from a 
triage POV. the user says: "oops, my problem doesn't exist after I ran 
the normalizing script" <-- is this good or bad?  if the script is well 
documented the user can edit it and run it again, possibly solving the 
problem themselves.

> > The latter is quickly tested,

the former is a time waster.

This is a very good point and I repeat, this is not suggested as 
compulsory, this is intended to make things easier not harder for people 
that do want to report things that may be specific to them without 
exposing irrelevant details they may consider private or personal.

--
Wm

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Wm via gnucash-devel


On 02/02/2019 15:40, David Carlson wrote:

Wouldn't it be simpler to create a library of template files designed to
exercise various features that a user could find one to illustrate his
concern?


To some extent this is already done in the build process.  Life always 
throws up something unexpected.  Further, users are by definition lazy 
and want the devs to look at *their* data rather than being expected to 
trawl through a set of files containing data not relevant to their real 
life situation in the hope that one of them shows the fault that, by 
definition, shouldn't have existed in the first place.  See the circular 
bit?




Thiswould bypass the need to figure out how to sanitize every possible user
file.


Sanitizing isn't that hard and we don't actually need perfection, just 
sufficient so that people are confident that the devs aren't snooping on 
them.



If the user wants, he could still build his own example file as some users
do now.


The problem is that some people build files that don't work for 
everyone; it does say "normalizing" in the Subject line, none of this is 
ever going to be compulsory.


--
Wm

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Wm via gnucash-devel


On 02/02/2019 09:59, Colin Law wrote:

Can all users save files as sqlite?  Does that need anything extra
installed on the OS side that may not be there?  Also what about
different builds of GC, do they all have sqlite?


I'm fairly sure all of the official builds can save SQLite.  If someone 
is rolling their own on a platform without the sqlite libraries then I 
think it would be unusual for them not to also have access to gnc on one 
of the production platforms, the whole idea being that the data should 
be easily transferable.


Even if someone didn't have SQLite my suggestion isn't taking something 
away from from them.  If someone can't save an SQLite file and run a 
script, the existing options are still there.


--
Wm




___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Geert Janssens

Op zaterdag 2 februari 2019 16:40:34 CET schreef David Carlson:
> Wouldn't it be simpler to create a library of template files designed to
> exercise various features that a user could find one to illustrate his
> concern?
> 
> Thiswould bypass the need to figure out how to sanitize every possible user
> file.
> 
> If the user wants, he could still build his own example file as some users
> do now.

Both approaches have benefits and drawbacks.

The number of possible ways something can go wrong in gnucash is near 
infinite. Sometimes the problems only appear purely due to the amount of data, 
sometimes it comes from migration issues (migration from older gnucash 
versions,...). It would be equally hard to come with a set of template files 
that would cover all of those.
>From that point of view the idea to be able to look at the user's own data 
file is attractive as that is known to illustrate the problem.

But I don't know how feasible it is to effectively obfuscate that data withoug 
resorting to a complex script that may introduce its own set of bugs or 
inadvertently also obfuscate the actual issue. The latter is quickly tested, 
the former is a time waster.

Geert


___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread David Carlson

On Sat, Feb 2, 2019, 9:25 AM Geert Janssens  Op zaterdag 2 februari 2019 10:19:02 CET schreef Wm via gnucash-devel:
> > On 02/02/2019 00:16, David Cousens wrote:
> > > As well as the account names you might also want to munge data in the
> > > description/memo fields. This can contain identifying information for
> > > customers/vendors.
> >
> > How about we just zap the stuff in description/memo fields by default?
> > They're not mathematically significant and rarely cause double entry
> > problems unless someone introduces unusual UI stuff in which case they
> > should be able to provide an example.
> >
> > > Also possible any data relating to the owner of the file
> > > which is stored in the file/database.
> >
> > Does your file/database have an obvious owner?  Mine doesn't apart from
> > the name of the file which is the first and obvious thing to change
> > before you send it off for someone else to look at.
> >
> > If you mean bits of text in reports they wouldn't be included in an
> > SQLite file.
> >
> > If you mean bits of text in outbound documents I think we've already
> > zapped them.
> >
> > Have I missed your point?
> >
>
> Yes, if you use business features, you may have entered business
> identifying
> data in File->Properties. It think that's what David is referring to.
> Similarly there may be customer and vendor data (names addresses) in the
> book
> that should equally be obfuscated. Just random data is fine.
>
> Continuing on that vein, if you have bills and invoices, aside from
> randomizing the transaction's split amounts and values you'll also have to
> do
> the same for invoice entries. And to make the book useful for detecting
> business data bugs this should happen in such a way that invoice tax and
> discount amounts remain consistent after multiplying with random numbers
> *and*
> that the invoice totals continue to match the business transactions
> amounts in
> AR/AP accounts.
>
> And to make that one level more complicated, after that the payment
> transactions *also* have to continue to match the new randomized invoice
> amount (if the invoice was paid in full).
>
> It doesn't end there, payments can be split over multiple invoices, so
> again
> when one randomizes invoice amounts care must be taken to adjust the
> payments
> in proportion to the invoice amount change or fully paid invoices suddenly
> can
> become partially paid or overpaid.
>
> While this is probably all possible I believe the resulting script will be
> so
> complex that it will become a source of bugs in itself which would divert
> developer time to debugging and maintaining this script rather than
> working on
> the effectively reported bug for which a sample data file was asked in the
> first place...
>
> Up until a book with only transactions, no business data at all it sounded
> like a useful tool.
>
> Oh and we haven't mentioned SXs and budgets yet...
>
> As for Colin's question: on Windows and MacOS sqlite is supported out of
> the
> box. On linux it may require the additional installation of a libdbi
> driver.
> Most distros I know have packages for this driver but they may not be
> installed by default.
>
> Geert
>
>
> ___
> gnucash-devel mailing list
> gnucash-devel@gnucash.org
> https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Wouldn't it be simpler to create a library of template files designed to
exercise various features that a user could find one to illustrate his
concern?

Thiswould bypass the need to figure out how to sanitize every possible user
file.

If the user wants, he could still build his own example file as some users
do now.

David Carlson

>
>
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Geert Janssens

Op zaterdag 2 februari 2019 10:19:02 CET schreef Wm via gnucash-devel:
> On 02/02/2019 00:16, David Cousens wrote:
> > As well as the account names you might also want to munge data in the
> > description/memo fields. This can contain identifying information for
> > customers/vendors.
> 
> How about we just zap the stuff in description/memo fields by default?
> They're not mathematically significant and rarely cause double entry
> problems unless someone introduces unusual UI stuff in which case they
> should be able to provide an example.
> 
> > Also possible any data relating to the owner of the file
> > which is stored in the file/database.
> 
> Does your file/database have an obvious owner?  Mine doesn't apart from
> the name of the file which is the first and obvious thing to change
> before you send it off for someone else to look at.
> 
> If you mean bits of text in reports they wouldn't be included in an
> SQLite file.
> 
> If you mean bits of text in outbound documents I think we've already
> zapped them.
> 
> Have I missed your point?
> 

Yes, if you use business features, you may have entered business identifying 
data in File->Properties. It think that's what David is referring to. 
Similarly there may be customer and vendor data (names addresses) in the book 
that should equally be obfuscated. Just random data is fine.

Continuing on that vein, if you have bills and invoices, aside from 
randomizing the transaction's split amounts and values you'll also have to do 
the same for invoice entries. And to make the book useful for detecting 
business data bugs this should happen in such a way that invoice tax and 
discount amounts remain consistent after multiplying with random numbers *and* 
that the invoice totals continue to match the business transactions amounts in 
AR/AP accounts.

And to make that one level more complicated, after that the payment 
transactions *also* have to continue to match the new randomized invoice 
amount (if the invoice was paid in full).

It doesn't end there, payments can be split over multiple invoices, so again 
when one randomizes invoice amounts care must be taken to adjust the payments 
in proportion to the invoice amount change or fully paid invoices suddenly can 
become partially paid or overpaid.

While this is probably all possible I believe the resulting script will be so 
complex that it will become a source of bugs in itself which would divert 
developer time to debugging and maintaining this script rather than working on 
the effectively reported bug for which a sample data file was asked in the 
first place...

Up until a book with only transactions, no business data at all it sounded 
like a useful tool.

Oh and we haven't mentioned SXs and budgets yet...

As for Colin's question: on Windows and MacOS sqlite is supported out of the 
box. On linux it may require the additional installation of a libdbi driver. 
Most distros I know have packages for this driver but they may not be 
installed by default.

Geert

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Colin Law

Can all users save files as sqlite?  Does that need anything extra
installed on the OS side that may not be there?  Also what about
different builds of GC, do they all have sqlite?Colin

Colin
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-02 Thread Wm via gnucash-devel


On 02/02/2019 00:16, David Cousens wrote:


As well as the account names you might also want to munge data in the
description/memo fields. This can contain identifying information for
customers/vendors. 


How about we just zap the stuff in description/memo fields by default? 
They're not mathematically significant and rarely cause double entry 
problems unless someone introduces unusual UI stuff in which case they 
should be able to provide an example.



Also possible any data relating to the owner of the file
which is stored in the file/database. 


Does your file/database have an obvious owner?  Mine doesn't apart from 
the name of the file which is the first and obvious thing to change 
before you send it off for someone else to look at.


If you mean bits of text in reports they wouldn't be included in an 
SQLite file.


If you mean bits of text in outbound documents I think we've already 
zapped them.


Have I missed your point?

Always possible, don't be put off by my rough and tumble impression of 
the idiot Trump, I do actually care.



The combination of the above would
probably be considered commercially sensitive information and at a personal
level what banks/service companies etc you deal with might be a possible
problem if it is in the public domain.


Ummm, that isn't really our problem, David.  If you subscribe to the 
"I'm an American and the government supports me" foolishness I'm 
wondering why the fuck any of you voted for the imbecile in charge at 
the moment!


Any banking account details have already been removed.

Next?

--
Wm





___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-01 Thread David Cousens

Wm

As well as the account names you might also want to munge data in the
description/memo fields. This can contain identifying information for
customers/vendors. Also possible any data relating to the owner of the file
which is stored in the file/database. The combination of the above would
probably be considered commercially sensitive information and at a personal
level what banks/service companies etc you deal with might be a possible
problem if it is in the public domain.

David Cousens




-
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-01 Thread Wm via gnucash-devel


On 01/02/2019 13:36, Wm via gnucash-devel wrote:

would someone other than idiot Stephen M Butler attempt a reply please

TIA

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-01 Thread Wm via gnucash-devel


On 01/02/2019 19:17, Stephen M. Butler wrote:

Ummm, Stephen M. Butler I don't think you were my intended audience.

Let me put you down gently.


It might be better to have a standardized test file that folks could
download, and run their scenario against.


Nope, we can do that already, I was addressing other realistic situations.


However, there are situations that arise where the only solution is to
look at the original file.  In that case some obfuscation would be
helpful.  I would think that memos and descriptions would also need to
be randomized.


My suggestion is they are zapped, no personal stuff at all


After a careful read, I realized you did intend to
randomize the transaction amoun  ts (which would have to be careful to
ensure the DR/CR remained balanced.


I'm one of the more intelligent people here, the tx will remain balanced.


Otherwise, one could at least get
the total Assets/Liabilities/Income/Expense values known for the
submitter.  That may be sensitive information.  I know that I've shared
some information that later reflection was "did I really give them that!"


Um


Now, to the XML vs SQLite argument.  Whatever script is applied to one
could easily have a counterpart that would apply to the other.  You
wouldn't have to manually (informally) edit the XML.  A known script
should provide a known outcome.


Not true in reverse if someone throws in some numbers no other person 
knows about.  Think about diminishing returns.


I can't correct this fucked up quote below, must be a Mexican border 
issue, sigh.  Looks like a Trump voter, fucked quotient in general.


>I suspect that many folks are using an

XML back-end and would rather not fiddle with a database back-end.


We know know that, we ask for a specific db when we need to test stuff.

I've given up correcting the quoting, sorry, folks.

 I'm

in that camp even though I'm a trained Oracle DBA and spent a couple
decades using that back-end professionally.


We are unimpressed unless you contribute.

Some of us also think training may have been wasted time if you end up 
not knowing much about databases.



I think the first step is having a standard test file that a use could
apply to their favorite back-end, run their scenario, check the
results.


Wrong, please read what I said before.  G.

I hate it when someone so obviously doesn't read.


If the problem is verified, then we have pretty good evidence
the problem is in the application.  If the problem doesn't show up, then
it indicates the problem may be in the data.  That would require a "data
forensic expert" (aka developer or some assistant) to look deeper into
the user's data file.  In that case a good obfuscation tool would come
in handy.


I'd say something obviously rude around now but Liz would zap me instead 
of the fool if past rules are anything to go by :(


I'd like someone with a clue to attempt an answer.

--
Wm


___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

2019-02-01 Thread Stephen M. Butler

On 2/1/19 5:36 AM, Wm via gnucash-devel wrote:
> Situation: someone reports a problem with gnc, at triage it becomes
> clear some data is going to be required to identify or solve the
> problem. Normal question?  Can you give us a file.
>
> Problem: for any number of reasons ranging from plain old personal
> privacy through to people that live in supposed liberal societies
> avoiding tax and people in supposed conservative societies avoiding
> persecution, sending live data isn't always appropriate.  The USA has
> become very weird about this and most of our development people are in
> the USA so hopefully they'll understand the politics of privacy,
> eventually.
>
> Suggestion: we try to make providing a file easier for people.
>
> My suggestion is we ask people to save a *copy* of their data in
> SQLite and they then run a script across that copy that munges and
> obfuscates
>
> 1. account names [1]
>
> 2. numbers [2]
>
> [1] people following this will probably be aware that gnc doesn't know
> about account names much beyond broad classes in spite of providing
> lots of names and not accommodating other accounting concepts such as
> the fact there is a level one up [3]  My point here is that account
> names are important to people but not gnc so why not just randomize
> them? Obvious way? copy the actual account name (the guid) to the user
> visible one.  this is a one way change unless someone has unusual
> settings on their SQLite file, if someone has those settings it seems
> reasonable to presume they also know how to turn them off and save the
> file again.
>
> [2] as long as the transaction stream balances the actual numbers
> don't matter (their will be occasions where the numbers are important
> but these tend to be number extremes related to commodities rather
> than anyone using gnc to do a Mr Putin vs Mr Trump sports bet).  In
> most cases multiplying any matching numbers by the same semi-random
> should produce a good file for examination so long as it is done
> consistently [4]
>
> [3] that is a long argument I am interested in conceptually rather
> than personally, it doesn't affect me as a UK person but makes me
> think Internationally.
>
> [4] I don't think a reductive discussion of true vs near true random
> [5] is appropriate, the significant point is the person viewing the
> data won't be able to work out the original number without significant
> effort and in most cases simply won't be able to work it out at all,
> we're talking computing assets I doubt anyone here has access to in
> order to get back *and* I believe the gnc people are actually
> motivated by solving problems, belief in the project and ordinary
> stuff like that so they won't even be looking.
>
> [5] Random is fun if only because there are so many ways of doing it.
>
> Questions: why SQLite rather than XML?  Because if a person runs an
> agreed script across their file we can be sure of an outcome.  Editing
> an XML file informally is scary, it immediately raises questions about
> consistency of data. Other SQL formats are not widely used, my
> proposal is we go for LCD where we can achieve normalization.
>
> Normalization will have to be balanced: privacy vs contribution to the
> project.
>
> I definitely want contribution from other people that work well with
> SQL, let's think about this together, people, I have written some
> scripts that confuse *my* data and I know that Geert is still waiting
> for me to send him a file.
>
> Geert is a good person, I just don't want to show him very personal
> stuff in my file.
>
> I have a plan for making showing a file easier, is anyone interested?
>
> This is the *start* of a conversation, I welcome thoughts. 


It might be better to have a standardized test file that folks could
download, and run their scenario against. 

However, there are situations that arise where the only solution is to
look at the original file.  In that case some obfuscation would be
helpful.  I would think that memos and descriptions would also need to
be randomized.  After a careful read, I realized you did intend to
randomize the transaction amoun  ts (which would have to be careful to
ensure the DR/CR remained balanced.  Otherwise, one could at least get
the total Assets/Liabilities/Income/Expense values known for the
submitter.  That may be sensitive information.  I know that I've shared
some information that later reflection was "did I really give them that!"

Now, to the XML vs SQLite argument.  Whatever script is applied to one
could easily have a counterpart that would apply to the other.  You
wouldn't have to manually (informally) edit the XML.  A known script
should provide a known outcome.  I suspect that many folks are using an
XML back-end and would rather not fiddle with a database back-end.  I'm
in that camp even though I'm a trained Oracle DBA and spent a couple
decades using that back-end professionally.

I think the first step is having a standard test file that a use could
apply

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

Re: [GNC-dev] Normalizing live data, a suggestion for discussion

24 matches

Site Navigation

Mail list logo

Footer information