You are exactly correct that this is an important distinction. But that wasn't really the problem. In the problem as stated, the numbers were word counts which are quite plausibly zero in the absence of an observation. For rating data, there is a very different situation.
Even with counts were no data == 0, there is some question about the proper way to handle the zeros. My own feeling is that the best interpretation is "not observed yet". If you use a proper probabilistic interpretation, then if you have made many observations and still have a zero, then you have a constraint while if you have made very few observations and have a zero, it means little. Again, with ratings data, the problem is different. There it is entirely appropriate to treat no data as something that gives us little or no information unless the ratings are chosen by the user in which case no rating is actually (slightly) informative. How you combine both forms of data is another question entirely. On Thu, May 28, 2009 at 12:07 AM, Sean Owen <[email protected]> wrote: > So there is no way to distinguish between 0 and "no value" -- > but conceptually those two are quite different things. Am I right > about this? I think the API would have to change then. I tripped on a > very similar problem early on in implementing cosine-measure / Pearson > correlation, for exactly the same reason. (Or perhaps I am just > projecting my past problem/solution here.) >
