On Sat, 30 May 2009, [ISO-8859-1] Sebasti�n Goinheix wrote:

> OK.This is a great proyect, and i`m very happy to participate (at least in
> the user list).
> Thank you very much.
>
> 2009/5/30 Allin Cottrell <cottrell(a)wfu.edu>
> > I'm in transit right now, but will try to offer and answer before
> > long.

Meanwhile Jack Lucchetti has posted a possible solution.  But I'll
go ahead and give mine too -- it's more complicated but I think
it may be more general.

I'm supposing you have a data set that is structurally similar to
this simple hypothetical example:

hhid y x
10004 100 1
10004 110 4
24532 90 4
24532 120 4
24532 100 2
39800 150 5
46541 100 4
46541 80 3
46541 90 6

where "hhid" records the household identifier for various
individuals, and y and x are the variables of interest.  I'm
assuming you want to consolidate the data by household, either by
summing the values or possibly taking a household average.  Here's
my solution:

<script>
# Supose the above data are in hh.txt
open hh.txt
scalar n = $nobs

# how many households are there?
matrix hhvals = values(hhid)
scalar nhh = rows(hhvals)
printf "Found %d households\n", nhh

# how many variables are there? (excluding the constant)
scalar nv = $nvars - 1
printf "We have %d variables\n", nv
# create a matrix to hold the household data (with an extra
# column for the number of members)
matrix X = zeros(nhh, nv + 1)
# create list of variables (excluding hhid)
list vars = dataset
vars -= hhid

# scalars for accounting
scalar j, Xrow, Xcol

# form household-level variables in matrix X: here I'm just
# summing the values for the members of the household
loop i=1..n --quiet
   loop j=1..nhh --quiet
      if hhid[i] = hhvals[j]
         printf "obs %d belongs to household %d\n", i, hhvals[j]
         Xrow = j
         break
      endif
   endloop
   # column 1 holds the household ID
   X[Xrow,1] = hhid[i]
   Xcol = 2
   loop foreach k vars --quiet
      X[Xrow, Xcol] += $k[i]
      Xcol++
   endloop
   # in the last column of X, cumulate the number of members
   # in the given household
   X[Xrow,Xcol] += 1
endloop

# print HH data in matrix form to check
print X

# replace original dataset with household version (one could
# form household means here, if wanted)
loop i=1..nhh --quiet
  hhid[i] = X[i,1]
  Xcol = 2
  loop foreach k vars --quiet
     $k[i] = X[i, Xcol]
     Xcol++
  endloop
endloop

# restrict the sample to the number of households and save
smpl 1 nhh
series nmembers = X[, nv+1]
setinfo nmembers -d "Number of people in household"
print --byobs
store hh2.gdt
</script>

The outline is that we take the original data, cumulate it into a
matrix, then use the matrix to overwrite the first nhh rows of the
original dataset, then finally chop off the unwanted rows with
"smpl" and save under a new name.  The household IDs don't have to
be consecutive, or 1-based, and the rows do not have to be
organized by household.

Allin Cottrell





Reply via email to