Yeah I agree that in many cases "no value" and 0 will be equivalent. I
probably misunderstand the problem domain, but in computing centroids
of points, it does seem that 0 and "dunno" are different. The average
of 0, 1 and 3 is 1.333, while the average of "dunno", 1, and 3 is
"dunno".

Perhaps the question still stands though... should the API not allow
the distinction, even if this problem does or doesn't need it?

On Thu, May 28, 2009 at 8:18 AM, Ted Dunning <[email protected]> wrote:
> You are exactly correct that this is an important distinction.  But that
> wasn't really the problem.  In the problem as stated, the numbers were word
> counts which are quite plausibly zero in the absence of an observation.  For
> rating data, there is a very different situation.
>
> Even with counts were no data == 0, there is some question about the proper
> way to handle the zeros.  My own feeling is that the best interpretation is
> "not observed yet".  If you use a proper probabilistic interpretation, then
> if you have made many observations and still have a zero, then you have a
> constraint while if you have made very few observations and have a zero, it
> means little.  Again, with ratings data, the problem is different.  There it
> is entirely appropriate to treat no data as something that gives us little
> or no information unless the ratings are chosen by the user in which case no
> rating is actually (slightly) informative.
>
> How you combine both forms of data is another question entirely.
>
> On Thu, May 28, 2009 at 12:07 AM, Sean Owen <[email protected]> wrote:
>
>> So there is no way to distinguish between 0 and "no value" --
>> but conceptually those two are quite different things. Am I right
>> about this? I think the API would have to change then. I tripped on a
>> very similar problem early on in implementing cosine-measure / Pearson
>> correlation, for exactly the same reason. (Or perhaps I am just
>> projecting my past problem/solution here.)
>>
>

Reply via email to