Yeah I agree that in many cases "no value" and 0 will be equivalent. I probably misunderstand the problem domain, but in computing centroids of points, it does seem that 0 and "dunno" are different. The average of 0, 1 and 3 is 1.333, while the average of "dunno", 1, and 3 is "dunno".
Perhaps the question still stands though... should the API not allow the distinction, even if this problem does or doesn't need it? On Thu, May 28, 2009 at 8:18 AM, Ted Dunning <[email protected]> wrote: > You are exactly correct that this is an important distinction. But that > wasn't really the problem. In the problem as stated, the numbers were word > counts which are quite plausibly zero in the absence of an observation. For > rating data, there is a very different situation. > > Even with counts were no data == 0, there is some question about the proper > way to handle the zeros. My own feeling is that the best interpretation is > "not observed yet". If you use a proper probabilistic interpretation, then > if you have made many observations and still have a zero, then you have a > constraint while if you have made very few observations and have a zero, it > means little. Again, with ratings data, the problem is different. There it > is entirely appropriate to treat no data as something that gives us little > or no information unless the ratings are chosen by the user in which case no > rating is actually (slightly) informative. > > How you combine both forms of data is another question entirely. > > On Thu, May 28, 2009 at 12:07 AM, Sean Owen <[email protected]> wrote: > >> So there is no way to distinguish between 0 and "no value" -- >> but conceptually those two are quite different things. Am I right >> about this? I think the API would have to change then. I tripped on a >> very similar problem early on in implementing cosine-measure / Pearson >> correlation, for exactly the same reason. (Or perhaps I am just >> projecting my past problem/solution here.) >> >
