[Gretl-users] Re: Bug with string-valued series when storing data

Allin Cottrell Sat, 18 Jul 2020 14:34:04 -0700

On Sat, 18 Jul 2020, Artur Tarassow wrote:

Am 17.07.20 um 16:32 schrieb Allin Cottrell:
On Fri, 17 Jul 2020, Artur Tarassow wrote:
Am 16.07.20 um 14:53 schrieb Allin Cottrell:
On Wed, 15 Jul 2020, Allin Cottrell wrote:
On Wed, 15 Jul 2020, Artur Tarassow wrote:
But what about the case when adding the " --permanent" flag?
I can see a case for shrinking the strings array when the --permanentoption is given, though it's not totally clear-cut.
Here's a follow-up. You could think of this as a prototype of what wemight do internally with a string-valued series on permanentsub-sampling.
Sorry for the late reply, Allin. Yes that looks good to me.
But would "permanent sub-sampling" mean that this only applies whenexecuting the smpl command with the --permanent flag? Or would it alsoapply when storing a sub-sampled data set?
I favour doing this only when the --permanent option is given. It's aninformation-destroying move, and I can imagine cases where one wants tosave a sub-sample and yet not lose the information in question. But if you_want_ to lose it, without using --permanent, then just store in a formatother than gdt or gdtb.
I understand your point, Allin. And getting it worked when using the--permanent option would be very useful.
But, let me loudly think about some of my use cases -- maybe some others haveto deal with similar ones...

[cases where using the --permanent option with "smpl" would clearlynot be convenient]

OK, here's what's now in git (not yet in snapshots, I'd prefer tosee some testing first):

(1) Imposing a sample restriction with the --permanent optionresults in "trimming" of string-valued series: only string valuesthat appear within the sub-sample are preserved, and the numericcoding for such series is adjusted accordingly. Note, this meansthat any given observation will have the same string value as it hadin the full dataset, but may not have the same numeric code.

(2) When using the "store" command with a native target (gdt orgdtb) there's a new option --trim-strvals which has a similareffect. We achieve this as follows:


* Any string-valued series are first backed-up (copied in RAM).

* Before we actually write the data file we "trim" as describedabove.

* Once the write is finished we restore the full form of thestring-valued series.

So you can sub-sample, store the data in trimmed form, then restorethe full dataset without loss of information -- or at least that'sthe idea! This has worked OK in my limited testing today, but moretesting is wanted.

One further remark: "store --trim-strvals" will work even whenthere's no sub-sample in place, in case the dataset contains anyredundant string values. I hadn't noticed before, but gretl'sgrunfeld.gdt contains a redundant 11th firmname, "American Steel"(there are only 10 firms in our dataset). You can remove that byusing store with the new option.


Allin
_______________________________________________
Gretl-users mailing list -- gretl-users@gretlml.univpm.it
To unsubscribe send an email to gretl-users-le...@gretlml.univpm.it
Website: 
https://gretlml.univpm.it/postorius/lists/gretl-users.gretlml.univpm.it/

[Gretl-users] Re: Bug with string-valued series when storing data

Reply via email to