Good question Gert. Good solution, Igor and I like Keith's formatting. I thought the list might be interested in some of the statistical issues involved in determining if this method of replacing null values is an appropriate method for your data analysis and alternatives that are available.
The statistical term for replacing "missing values" (a type of null value) with a computed or selected value is "imputation." This problem/solution presented on this list is an implementation of a type of mean imputation. The statistical language R, has an entire package devoted to imputation (although ironically, it doesn't have this exact method -- it calculates the mean of an entire column without grouping or performs a more complex analysis. Although that may be because R experts know a way to add the grouping.). http://cran.r-project.org/web/packages/imputation/imputation.pdf The Wikipedia article, "Imputation (statistics)" http://en.wikipedia.org/wiki/Imputation_(statistics) points out some of the tradeoffs involved: "Another imputation technique involves replacing any missing value with the mean of that variable for all other cases [records], which has the benefit of not changing the sample mean for that variable. However, mean imputation attenuates any correlations involving the variable(s) that are imputed. This is because, in cases with imputation, there is guaranteed to be no relationship between the imputed variable and any other measured variables. Thus, mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis. Regression imputation has the opposite problem of mean imputation. A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing. In other words, available information for complete and incomplete cases is used to predict whether a value on a specific variable is missing or not. Fitted values from the regression model are then used to impute the missing values. The problem is that the imputed data do not have an error term included in their estimation, thus the estimates fit perfectly along the regression line without any residual variance. This causes relationships to be over identified and suggest greater precision in the imputed values than is warranted. The regression model predicts the most likely value of missing data but does not supply uncertainty about that value." There is a lot more in the Wikipedia article, but this seemed like the most relevant section. HTH, Jim Callahan On Sat, Nov 2, 2013 at 2:23 PM, Igor Tandetnik <i...@tandetnik.org> wrote: > On 11/2/2013 1:06 PM, Gert Van Assche wrote: > >> All, I have this table: >> >> DROP TABLE T; >> CREATE TABLE T (N, V, G); >> INSERT INTO T VALUES('a', 1, 'x'); >> INSERT INTO T VALUES('b', 3, 'x'); >> INSERT INTO T VALUES('c', null, 'x'); >> INSERT INTO T VALUES('d', 80, 'y'); >> INSERT INTO T VALUES('e', null, 'y'); >> INSERT INTO T VALUES('f', 60, 'y'); >> INSERT INTO T VALUES('g', null, 'y'); >> INSERT INTO T VALUES('h', null, 'z'); >> INSERT INTO T VALUES('i', 111, 'z'); >> >> I would like to see where N='c', V as the average for the group (G) were >> this record belongs to (so 'x'). >> >> Thus where N='c' I would get 2, and where N='e' or 'g', it would be 70, >> and >> where N=h it would be 111. >> > > I'm not sure I quite follow, but something like this perhaps: > > update T set V = (select avg(V) from T t2 where T.G = t2.G) > where V is null; > > -- > Igor Tandetnik > > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users