Good question Gert. Good solution, Igor and I like Keith's formatting.

I thought the list might be interested in some of the statistical issues
involved in determining if this method of replacing null values is an
appropriate method for your data analysis and alternatives that are
available.

The statistical term for replacing "missing values" (a type of null value)
with a computed or selected value is  "imputation."

This problem/solution presented on this list is an implementation of a type
of mean imputation. The statistical language R, has an entire package
devoted to imputation (although ironically, it doesn't have this exact
method -- it calculates the mean of an entire column without grouping  or
performs a more complex analysis. Although that may be because R experts
know a way to add the grouping.).
http://cran.r-project.org/web/packages/imputation/imputation.pdf

The Wikipedia article, "Imputation (statistics)"
http://en.wikipedia.org/wiki/Imputation_(statistics)
points out some of the tradeoffs involved:

"Another imputation technique involves replacing any missing value with the
mean of that variable for all other cases [records], which has the benefit
of not changing the sample mean for that variable. However, mean imputation
attenuates any correlations involving the variable(s) that are imputed.
This is because, in cases with imputation, there is guaranteed to be no
relationship between the imputed variable and any other measured variables.
Thus, mean imputation has some attractive properties for univariate
analysis but becomes problematic for multivariate analysis.

Regression imputation has the opposite problem of mean imputation. A
regression model is estimated to predict observed values of a variable
based on other variables, and that model is then used to impute values in
cases where that variable is missing. In other words, available information
for complete and incomplete cases is used to predict whether a value on a
specific variable is missing or not. Fitted values from the regression
model are then used to impute the missing values. The problem is that the
imputed data do not have an error term included in their estimation, thus
the estimates fit perfectly along the regression line without any residual
variance. This causes relationships to be over identified and suggest
greater precision in the imputed values than is warranted. The regression
model predicts the most likely value of missing data but does not supply
uncertainty about that value."

There is a lot more in the Wikipedia article, but this seemed like the most
relevant section.

HTH,
Jim Callahan


On Sat, Nov 2, 2013 at 2:23 PM, Igor Tandetnik <i...@tandetnik.org> wrote:

> On 11/2/2013 1:06 PM, Gert Van Assche wrote:
>
>> All, I have this table:
>>
>> DROP TABLE T;
>> CREATE TABLE T (N, V, G);
>> INSERT INTO T VALUES('a', 1, 'x');
>> INSERT INTO T VALUES('b', 3, 'x');
>> INSERT INTO T VALUES('c', null, 'x');
>> INSERT INTO T VALUES('d', 80, 'y');
>> INSERT INTO T VALUES('e', null, 'y');
>> INSERT INTO T VALUES('f', 60, 'y');
>> INSERT INTO T VALUES('g', null, 'y');
>> INSERT INTO T VALUES('h', null, 'z');
>> INSERT INTO T VALUES('i', 111, 'z');
>>
>> I would like to see where N='c', V as the average for the group (G) were
>> this record belongs to (so 'x').
>>
>> Thus where N='c' I would get 2, and where N='e' or 'g', it would be 70,
>> and
>> where N=h it would be 111.
>>
>
> I'm not sure I quite follow, but something like this perhaps:
>
> update T set V = (select avg(V) from T t2 where T.G = t2.G)
> where V is null;
>
> --
> Igor Tandetnik
>
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to