On Mon, 18 Jun 2012, Allin Cottrell wrote:

> A few changes in recent gretl CVS address the issue of
> handling very large datasets -- datasets that will not fit
> into RAM in their entirety. [...]

First of all, let me thank Allin for all the work he's done in this 
direction: I ran a few tests on the new CVS features and I can confirm 
that every test I tried works splendidly.

That said, I have the feeling that, in order to use effectively the 
datasets Allin is referring to, we need an extra ingredient (which, IMHO, 
is THE feature that made Stata the killer package in some quarters of the 
econometrics profession): the ability to extract data sensibly by 
performing those operations that, in database parlance, are called JOINs.

Let me provide an example for the unfamiliar: suppose you have two 
datafiles, one for individuals and one for households (example attached). 
Of course, the two datafiles don't match by row, since there may be more 
than one individual in a household; however, you may want to attach data 
from one dataset to the other one.

The following script should clarify what I have in mind:

<hansl>
# example 1 (easy): assigning a household variable to an individual
open hh_data.csv --quiet
matrix X = {HH_NO, AREA}

open ind_data.csv --quiet --preserve
series AREA = HH_NO
AREA = replace(AREA, X[,1], X[,2])

print -o

# example 3 (harder): going in the opposite direction
open ind_data.csv --quiet
matrix v = values(HH_NO)
matrix X = v ~ 0

loop i = 1..rows(v) --quiet
     smpl HH_NO=v[i] --restrict --replace
     X[i,2] = mean(AGE)
end loop

open hh_data.csv --quiet --preserve
series AVGAGE = X[,2]

print -o
</hansl>

In the first example, we attach to the individuals dataset the area where 
the household lives, while in the second one we attach to the households 
dataset the average age of its members; I'm sure you'll have no problems 
in imagining all the possible generalisations (and caveats: for example, 
NAs and missing rows have to be handled).

For very large bodies of data, this is typically the job for SQL-based 
DBMSes, such as Oracle, Postgres, MySQL, etcetera, which do this kind of 
jobs with amazing speed and accuracy through astonishingly clever 
algorithms and tricks. However, you need to have (1) the data in that 
format already (2) a decent knowledge of SQL, which very few applied 
economists have.

It'd be extremely nice to be able to perform this kind of operations from 
a gretl database: in the example above, I would imagine that you could 
have a .bin file containing all the series in the two datasets, with some 
mechanism in place for identifying which series is a household-level 
series, which one is a in individual-level series, and so forth.

--------------------------------------------------
  Riccardo (Jack) Lucchetti
  Dipartimento di Economia

  Università Politecnica delle Marche
  (formerly known as Università di Ancona)

  r.lucchetti(a)univpm.it
  http://www2.econ.univpm.it/servizi/hpp/lucchetti
--------------------------------------------------
HH_NO	AREA
1	1
2	1
3	2
4	2
5	2
6	3
7	3
8	3
HH_NO	IND_NO	GENDER	AGE
1	1	1	44
1	2	0	42
1	3	1	12
1	4	0	7
2	1	1	22
2	2	0	22
3	3	0	56
4	1	0	32
4	2	1	11
4	3	1	9
5	1	1	76
5	2	0	68
6	1	0	55
7	1	0	43
8	1	1	69
8	2	1	45

Reply via email to