Hello,

I have a dataset with about 50.000 records and 30 variables. Among the 
variables are 2 categorical with many categories: Federal state (16 
categories) and branch of economic activity (80 - 100 categories). Since I 
want to produce a synthetic dataset, I double the dataset by replacing all 
values of one variable with missings.

Now to my problem with IVEware: If I want to impute for example the 
federal state, after 5-6 hours still the first iteration is running, so it 
takes too long.
My second attempt: I compute dummies for the 16 federal states. At first I 
impute the state having the most units, then the one with the second most 
units and so on. All in all this works well, but for the last state there 
are only 20-30 units remaining (original data: 358 units). I tried to swap 
the order of the smallest and the second smallest state: This didn't solve 
the problem. Now the second smallest state has by far too few units in the 
synthetic dataset. Does anyone have any further suggestions how one can 
handle categorical variables with many values in IVEware?


Kind regards
Hans-Peter Hafner

STATISTIK HESSEN

-----------
Hessisches Statistisches Landesamt
Rheinstra?e 35/37
65175 Wiesbaden
Internet: http://www.statistik-hessen.de

Telefon: 0611 3802-815
Telefax: 0611 3802-890
E-Mail: [email protected]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://lists.utsouthwestern.edu/pipermail/impute/attachments/20090921/3909bc2b/attachment.htm
From newgardc <@t> ohsu.edu  Mon Sep 21 22:21:10 2009
From: newgardc <@t> ohsu.edu (Craig Newgard)
Date: Tue Sep 22 14:05:55 2009
Subject: [Impute] IVEware: Imputation of categorical variables with many
        categories
In-Reply-To: 
<offe1f84be.f9f849fd-onc1257638.002fca0b-c1257638.00301...@statistik-hessen.de>
References: 
<offe1f84be.f9f849fd-onc1257638.002fca0b-c1257638.00301...@statistik-hessen.de>
Message-ID: <[email protected]>

Hans-Peter,
Not sure if you've found a response to your question below yet, but I have been 
through similar scenarios with IVEware before.  My suggestion would be to keep 
the primary (polytomous) categorical terms in the MI code, as this allows 
IVEware to create dummies, while still recognizing that they are mutually 
exclusive categories.  If you create your own dummies, you run the risk of 
imputing positive values for multiple dummies on the same observation.  A few 
other things you could try to improve efficiency include: making sure the 
largest category (reference) is coded with the highest number in the term 
(default reference in IVEware) and examine small categories within the term and 
consider collapsing categories if appropriate.  I've found that many IVEware MI 
models routinely require 24+ hours to run with such terms included.  If these 
suggestions still fail to increase the MI efficiency, you could also consider 
running parallel chains of MI, using separate MI models for subjects within 
each category of the original term (this is commonly used for interaction terms 
and works best if each category has an adequate number of observations and 
minimal missing data).

Craig
________________________________________
From: [email protected] 
[[email protected]] On Behalf Of 
[email protected] [[email protected]]
Sent: Monday, September 21, 2009 1:45 AM
To: [email protected]
Subject: [Impute] IVEware: Imputation of categorical variables with many        
categories

Hello,

I have a dataset with about 50.000 records and 30 variables. Among the 
variables are 2 categorical with many categories: Federal state (16 categories) 
and branch of economic activity (80 - 100 categories). Since I want to produce 
a synthetic dataset, I double the dataset by replacing all values of one 
variable with missings.

Now to my problem with IVEware: If I want to impute for example the federal 
state, after 5-6 hours still the first iteration is running, so it takes too 
long.
My second attempt: I compute dummies for the 16 federal states. At first I 
impute the state having the most units, then the one with the second most units 
and so on. All in all this works well, but for the last state there are only 
20-30 units remaining (original data: 358 units). I tried to swap the order of 
the smallest and the second smallest state: This didn't solve the problem. Now 
the second smallest state has by far too few units in the synthetic dataset. 
Does anyone have any further suggestions how one can handle categorical 
variables with many values in IVEware?


Kind regards
Hans-Peter Hafner

STATISTIK HESSEN

-----------
Hessisches Statistisches Landesamt
Rheinstra?e 35/37
65175 Wiesbaden
Internet: http://www.statistik-hessen.de

Telefon: 0611 3802-815
Telefax: 0611 3802-890
E-Mail: [email protected]

Reply via email to